Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

SNP Discovery through Next-Generation Sequencing and Its Applications

SNP Discovery through Next-Generation Sequencing and Its Applications Hindawi Publishing Corporation International Journal of Plant Genomics Volume 2012, Article ID 831460, 15 pages doi:10.1155/2012/831460 Review Article SNP Discovery through Next-Generation Sequencing and Its Applications 1 2 1, 3 Santosh Kumar, Travis W. Banks, and Sylvie Cloutier Department of Plant Science, University of Manitoba, Winnipeg, MB, Canada R3T 2N2 Department of Applied Genomics, Vineland Research and Innovation Centre, Vineland Station, ON, Canada L0R 2E0 Cereal Research Centre, Agriculture and Agri-Food Canada, Winnipeg, MB, Canada R3T 2M9 Correspondence should be addressed to Sylvie Cloutier, sylvie.j.cloutier@agr.gc.ca Received 3 August 2012; Accepted 8 October 2012 Academic Editor: Roberto Tuberosa Copyright © 2012 Santosh Kumar et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The decreasing cost along with rapid progress in next-generation sequencing and related bioinformatics computing resources has facilitated large-scale discovery of SNPs in various model and nonmodel plant species. Large numbers and genome-wide availability of SNPs make them the marker of choice in partially or completely sequenced genomes. Although excellent reviews have been published on next-generation sequencing, its associated bioinformatics challenges, and the applications of SNPs in genetic studies, a comprehensive review connecting these three intertwined research areas is needed. This paper touches upon various aspects of SNP discovery, highlighting key points in availability and selection of appropriate sequencing platforms, bioinformatics pipelines, SNP filtering criteria, and applications of SNPs in genetic analyses. The use of next-generation sequencing methodologies in many non-model crops leading to discovery and implementation of SNPs in various genetic studies is discussed. Development and improvement of bioinformatics software that are open source and freely available have accelerated the SNP discovery while reducing the associated cost. Key considerations for SNP filtering and associated pipelines are discussed in specific topics. A list of commonly used software and their sources is compiled for easy access and reference. 1. Introduction Arabidopsis,and rice [11–15]. Genetic applications such as linkage mapping, population structure, association studies, Molecular markers are widely used in plant genetic research map-based cloning, marker-assisted plant breeding, and and breeding. Single Nucleotide Polymorphisms (SNPs) are functional genomics continue to be enabled by access to currently the marker of choice due to their large numbers large collections of SNPs. Arabidopsis thaliana was the first in virtually all populations of individuals. The applications plant genome sequenced [16]followedsoonafter by rice of SNP markers have clearly been demonstrated in human [17, 18]. In the year 2011 alone, the number of plant genomics where complete sequencing of the human genome genomes sequenced doubled as compared to the number led to the discovery of several million SNPs [1]and tech- sequenced in the previous decade, resulting in currently, 31 nologies to analyze large sets of SNPs (up to 1 million) have and counting, publicly released sequenced plant genomes been developed. SNPs have been applied in areas as diverse (http://www.phytozome.net/). With the ever increasing as human forensics [2] and diagnostics [3], aquaculture [4], throughput of next-generation sequencing (NGS), de novo marker assisted-breeding of dairy cattle [5], crop improve- and reference-based SNP discovery and application are now ment [6], conservation [7], and resource management in feasible for numerous plant species. fisheries [8]. Functional genomic studies have capitalized Sequencing refers to the identification of the nucleotides upon SNPs located within regulatory genes, transcripts, in a polymer of nucleic acids, whether DNA or RNA. Since and Expressed Sequence Tags (ESTs) [9, 10]. Until recently its inception in 1977, sequencing has brought about the large scale SNP discovery in plants was limited to maize, field of genomics and increased our understanding of 2 International Journal of Plant Genomics the organization and composition of plant genomes. and the enzymes necessary to create fluorescence through Tremendous improvements in sequencing have led to the the consumption of inorganic phosphate produced during generation of large amounts of DNA information in a very sequencing. The instrument washes the picotiter plate with short period of time [19]. The analyses of large volumes each of the DNA bases in turn. As template-specific incorpo- of data generated through various NGS platforms require ration of a base by DNA polymerase occurs, a pyrophosphate powerful computers and complex algorithms and have led (PPi) is produced. This pyrophosphate is detected by an to a recent expansion of the bioinformatics field of research. enzymatic luminometric inorganic pyrophosphate detection This book chapter focuses on the apriori discovery of SNPs assay (ELIDA) through the generation of a light signal through NGS, bioinformatics tools and resources, and the following the conversion of PPi into ATP [25]. Thus, the wells various downstream applications of SNPs. in which the current nucleotides are being incorporated by the sequencing reaction occurring on the bead emit a light signal proportional to the number of nucleotides incorpo- 2. History and Evolution of rated, whereas wells in which the nucleotides are not being Sequencing Technologies incorporated do not. The instrument repeats the sequential nucleotide wash cycle hundreds of times to lengthen the 2.1. Invention of Sequencing. In 1977, two sequencing meth- sequences. The 454 GS FLX Titanium XL platform currently ods were developed and published. The Sanger method is generates up to 700 MB of raw 750 bp reads in a 23 hour run. a sequencing-by-synthesis (SBS) method that relies on a The technology has difficulty quantifying homopolymers combination of deoxy- and dideoxy-labeled chain terminator resulting in insertions/deletions and has an overall error rate nucleotides [20]. The first complete genome sequencing, that of approximately 1%. Reagent costs are approximately $6,200 of bacteriophage phi X174, was achieved that same year using per run [26]. this pioneering method [21]. The chemical modification followed by cleavage at specific sites method also published in 1977 [22] quickly became the less favored of the two methods 2.2.2. Illumina Sequencing. Illumina technology, acquired by because of its technical complexities, use of hazardous Illumina from Solexa, followed the release of 454 sequencing. chemicals, and inherent difficulty in scale-up. In contrast, With this sequencing approach, fragments of DNA are the Sanger method, for which Frederick Sanger was awarded hybridized to a solid substrate called a flow cell. In a his second Nobel Prize in chemistry in 1980, was quickly process called bridge amplification, the bound DNA template adopted by the biotechnology industry which implemented fragments are amplified in an isothermal reaction where it using a broad array of chemistries and detection methods copies of the template are created in close proximity to the [19]. original. This results in clusters of DNA fragments on the flow cell creating a “lawn” of bound single strand DNA molecules. The molecules are sequenced by flooding the flow 2.2. Sequencing Technologies. In the last decade, new sequen- cell with a new class of cleavable fluorescent nucleotides and cing technologies have outperformed Sanger-based sequenc- the reagents necessary for DNA polymerization [27]. A ing in throughput and overall cost, if not quite in sequence complementary strand of each template is synthesized one length and error rate [23]. This section will focus on the base at a time using fluorescently labeled nucleotides. The three main NGS platforms as well as the two main third- fluorescent molecule is excited by a laser and emits light, the generation sequencing (TGS) platforms, their throughput colour of which is different for each of the four bases. The and relative cost. We made every effort to ensure the accuracy fluorescent label is then cleaved off and a new round of of the data at the time of submission. However, the cost and polymerization occurs. Unlike 454 sequencing, all four bases throughput of these sequencing platforms change rapidly are present for the polymerization step and only a single and, as such, our analysis only represents a snapshot in molecule is incorporated per cycle. The flagship HiSeq2500 time. The flux of innovation in this field imposes a need sequencing instrument from Illumina can generate up to for constant assessment of the technologies’ potentials and 600 GB per run with a read length of 100 nt and 0.1% error realignment of research goals. rate. The Illumina technique can generate sequence from opposite ends of a DNA fragment, so called paired-end (PE) 2.2.1. Roche (454) Sequencing. Pyrosequencing was the first reads. Reagent costs are approximately $23,500 per run [26]. of the new highly parallel sequencing technologies to reach the market [24]. It is commonly referred to as 454 sequencing after the name of the company that first commercialized 2.2.3. Applied Biosystems (SOLiD) Sequencing. The SOLiD it. It is an SBS method where single fragments of DNA system was jointly developed by the Harvard Medical School are hybridized to a capture bead array and the beads are and the Howard Hughes Medical Institute [28]. The library emulsified with regents necessary to PCR amplifying the preparation in SOLiD is very similar to Roche/454 in individually bound template. Each bead in the emulsion which clonal bead populations are prepared in microreac- acts as an independent PCR where millions of copies of the tors containing DNA template, beads, primers, and PCR original template are produced and bound to the capture components. Beads that contain PCR products amplified by beads which then serve as the templates for the subsequent emulsion PCR are enriched by a proprietary process. The sequencing reaction. The individual beads are deposited DNA templates on the beads are modified at their 3 end into a picotiter plate along with DNA polymerase, primers, to allow attachment to glass slides. A primer is annealed International Journal of Plant Genomics 3 to an adapter on the DNA template and a mixture of it to reach the unincorporated nucleotides above [32]. The fluorescently tagged oligonucleotides is pumped into the Pacific Biosciences sequencers can generate up to 140 MB flow cell. When the oligonucleotide matches the template of sequences per run (per smart cell) with reads of 2.5 Kbp sequence, it is ligated onto the primer and the unincorpo- at ∼85% accuracy. The cost per run per smart cell is rated nucleotides are washed away. A charged couple device approximately $600. (CCD) camera captures the different colours attached to the Among the TGS technologies, Pacific Biosciences primer. Each fluorescence wavelength corresponds to a par- SMART and Heliscope tSMS have been used in charac- ticular dinucleotide combination. After image capture, the terizing bacterial genomes and in human-disease-related fluorescent tag is removed and new set of oligonucleotides studies [31]; however, TGS has yet to be capitalized upon in are injected into the flow cell to begin the next round of DNA plant genomes. The Heliscope generates short reads (55 bp) ligation [19]. This sequencing-by-ligation method in SOLiD- which may cause ambiguous read mapping due to the 5500x1 platform generates up to 1,410 million PE reads of presence of paralogous sequences and repetitive elements 75 + 35 nt each with an error rate of 0.01% and reagent cost in plant genomes. The Pacific Biosciences reads have high of approximately $10,500 per run [26]. error rates which limit their direct use in SNP discovery. Although widely accepted and used, the NGS platforms However, their long reads offer a definite advantage to suffer from amplification biases introduced by PCR and fill gaps in genomic sequences and, at least in bacterial dephasing due to varying extension of templates. The genomes, NGS reads have proven capable of “correcting” TGS technologies use single molecule sequencing which the base call errors of this TGS technology [33–36]. Hybrid eliminates the need for prior amplification of DNA thus assemblies incorporating short (Illumina, SOLiD), medium overcoming the limitations imposed by NGS. The advantages (454/Roche), and long reads (Pac-Bio) have the potential to offered by TGS technology are (i) lower cost, (ii) high yield better quality reference genomes and, as such, would throughput, (iii) faster turnaround, and (iv) longer reads [19, provide an improved tool for SNP discovery. 29]. The TGS can broadly be classified into three different The choice of a sequencing strategy must take into categories: (i) SBS where individual nucleotides are observed account the research goals, ability to store and analyze data, as they incorporate (Pacific Biosciences single molecule real the ongoing changes in performance parameters, and the time (SMART), Heliscope true single molecule sequencing cost of NGS/TGS platforms. Some key considerations include (tSMS), and Life Technologies/Starlight and Ion Torrent), (ii) cost per raw base, cost per consensus base, raw and consensus nanopore sequencing where single nucleotides are detected accuracy of bases, read length, cost per read, and availability as they pass through a nanopore (Oxford/Nanopore), and of PE or single end reads. The pre- and postprocessing (iii) direct imaging of individual molecules (IBM). protocols such as library construction [37] and pipeline development and implementation for data analysis [38]are also important. 2.2.4. Helicos Biosciences Corporation (Heliscope) Sequencing. Heliscope sequencing involves DNA library preparation and DNA shearing followed by addition of a poly-A tail to the 2.3. RNA and ChIP Sequencing. Genome-wide analyses sheared DNA fragments. These poly-A tailed DNA fragments of RNA sequences and their qualitative and quantitative are attached to flow cells through poly-T anchors. The measurements provide insights into the complex nature of sequencing proceeds by DNA extension with one out of regulatory networks. RNA sequencing has been performed 4 fluorescent tagged nucleotides incorporated followed by on a number of plant species including Arabidopsis [39], detection by the Heliscope sequencer. The fluorescent tag soybean [40], rice [41], and maize [42]for transcript on the incorporated nucleotide is then chemically cleaved profiling and detection of splice variants. RNA sequencing to allow subsequent elongation of DNA [30]. Heliscope has been used in de novo assemblies followed by SNP sequencers can generate up to 28 GB of sequence data per run discovery performed in nonmodel plants such as Eucalyptus (50 channels) with maximum read length of 55 bp at ∼99% grandis [43], Brassica napus [44], and Medicago sativa [45]. accuracy [31]. The cost per run per channel is approximately RNA deep-sequencing technologies such as digital gene $360. expression [46] and Illumina RNASeq [47] are both qualita- tive and quantitative in nature and permit the identification 2.2.5. Pacific Biosciences SMART Sequencing. The Pacific of rare transcripts and splice variants [48]. RNA sequencing Biosciences sequencer uses glass anchored DNA polymerases may be performed following its conversion into cDNA which are housed at the bottom of a zero-mode waveguide that can then be sequenced as such. This method is, (ZMW). DNA fragments are added into the ZMW chamber however, prone to error due to (i) the inefficient nature with the anchored DNA polymerase and nucleotides, each of reverse transcriptases (RTs) [49], (ii) DNA-dependent labeled with a different colour fluorophore, and are diffused DNA polymerase activity of RT causing spurious second from above the ZMW. As the nucleotides circulate through strand DNA [50], and (iii) artifactual cDNA synthesis due the ZMW, only the incorporated nucleotides remain at to template switching [51]. Direct RNA sequencing (DRS) the bottom of the ZMW while unincorporated nucleotides developed by Helicos Biosciences Corporation is a high diffuse back above the ZMW. A laser placed below the throughput and cost-effective method which eliminates the ZMW excites only the fluorophores of the incorporated need for cDNA synthesis and ligation/amplification leading nucleotides as the ZMW entraps the light and does not allow to improved accuracy [52]. 4 International Journal of Plant Genomics Chromatin immunoprecipitation (ChIP) is a specialized most biologists are unfamiliar with Linux operating systems, sequencing method that was specifically designed to identify its structure and command lines, thereby imposing a steep DNA sequences involved in in vivo protein DNA interaction learning curve for adoption. Linux-based software such as [53]. ChIP-sequencing (ChIP-Seq) is used to map the Bowtie [59], BWA [60], and SOAP2/3 [61] have been used binding sites of transcription factors and other DNA binding widely for the analysis of NGS data. Other software may not sites for proteins such as histones. As such, ChIP-Seq does have gained broad acceptance but may have unique features not aid SNP discovery, but the availability of SNP data along worthnoting.For reviewsonNGS software,see Li and with ChIP-Seq allows the study of allele-specific states of Homer [62], Wang et al. [58], and Treangen and Salzberg chromatin organization. Deep sequence coverage leading to [63]. Characteristics of the most common NGS software dense SNP maps permits the identification of transcription and their attributes are listed in Table 1, and their download factor binding sites and histone-mediated epigenetic modi- information can be found in Table 4. fications [54]. ChIP-Seq can be performed on serial analysis of gene expression (SAGE) tags or PE using Sanger, 454, and Illumina platforms [55, 56]. 3.2. Consideration for Software Selection. In selecting soft- The DNA, RNA, and ChIP-Seq data is analysed using ware for NGS data analysis one must consider, among other a reference sequence if available or, in the absence of such things, the sequencing platform, the availability of a reference reference, it requires de novo assembly, all of which is genome, the computing and storage resources necessary, and performed using specialized software, algorithms, pipelines, the bioinformatics expertise available. Algorithms used for and hardware. sequence analysis have matured significantly but may still require computing power beyond what is currently available in most genomics facilities and/or long processing time. For 3. Computing Resources for Sequence Assembly example, in aligning 2 × 13,326,195 paired-end reads (76 bp) from The Cancer Genome Atlas project (SRR018643) [64], The next-generation platforms generate a considerable SHRiMP [65] took 1,065 hrs with a peak memory footprint amount of data and the impact of this with respect to of 12 gigabytes to achieve the mapping of 81% of the reads data storage and processing time can be overlooked when to the human genome reference whereas Bowtie used 2.9 designing an experiment. Bioinformatics research is con- gigabytes of memory, a run time of 2.2 hrs but only achieved stantly developing new software and algorithms, data storage a 67% mapping rate [58]. Both time and memory become approaches, and even new computer architectures to better critical when dealing with a very large NGS dataset. Fast and meet the computation requirements for projects incorpo- memory efficient sequence mapping seems to be preferred rating NGS. This chapter describes the state-of-the-art with over slower, memory demanding software even at the cost respect to software for NGS alignment and analysis at the of a reduced mapping rate. It should be noted that a higher time of writing. percentage of mapped reads is not a strict measure of quality because it may be indicative of a higher level of misaligned 3.1. Software for Sequence Analysis. Both commercial and reads or reads aligned against repetitive elements, features noncommercial sequence analysis software are available for that are not desirable [63]. Windows, Macintosh, and Linux operating systems. NGS In the absence of a reference genome, de novo assembly companies offer proprietary software such as consensus of a plant genome is achieved using sequence information assessment of sequence and variation (Cassava) for Illu- obtained through a combination of Sanger and/or NGS mina data and Newbler for 454 data. Such software tend of bacterial artificial chromosome (BAC) clones, or by to be optimized for their respective platform but have whole genome shotgun (WGS) with NGS [66]. De novo limited cross applicability to the others. Web-based por- assemblies are time consuming and require much greater tals such as Galaxy [57] are tailored to a multitude of computing power than read mapping onto a reference analyses, but the requirement to transfer multigigabyte genome. The assembly accuracy depends in part on the read sequence files across the internet can limit its usabil- length and depth as well as the nature of the sequenced ity to smaller datasets. Commercially available software genome. The genomes of Arabidopsis thaliana [16], rice such as CLC-Bio (http://www.clcbio.com/) and SeqMan [67], and maize [68] were generated using a BAC-by-BAC NGen (http://www.dnastar.com/t-sub-products-genomics- approach while poplar [69], grape [70], and sorghum seqman-ngen.aspx) provide a friendly user interface, are [71] genomic sequences were obtained through WGS. All compatible with different operating systems, require mini- genomes sequenced to date are fragmented to varying mal computing knowledge, and are capable of performing degrees because of the inability of sequencing technologies multiple downstream analyses. However, they tend to be rel- and bioinformatics algorithms to assemble through highly atively expensive, have narrow customizability, and require conserved repetitive elements. A list of current plant genome locally available high computing power. A recent review by sequencing projects, their sequencing strategies, and status Wang et al. [58] recommends Linux-based programs because from standard draft to finished can be found in the review by they are often free, not specific to any sequencing platform, Feuillet et al. [72]. and less computing power hungry and, as a consequence, Software programs such as Mira [73], SOAPdenovo [74], tend to perform faster. Flexibility in the parameter’s choice ABySS [75], and Velvet [76]havebeenusedfor de novo for read assembly is another major advantage. However, assembly. MIRA is well documented and can be readily International Journal of Plant Genomics 5 Table 1: List of most cited/used software for sequence assembly of NGS data. Source locations for these software are compiled in Table 4. Assembly type Supported parameters Name (current version) Output format Platform (algorithm) Color space Read length Gapped alignment Paired-end Reference CLC-Bio CLC-Bio Yes Arbitrary Yes Yes Linux/Windows/Mac OS X Reference ACE, BAM SeqMan NGen Yes Arbitrary Yes Yes Windows/Mac OS X Next Reference NextGENe Yes Arbitrary Yes Yes Windows/Mac OS X GENe Bowtie (2) Reference (FM-index) Yes Arbitrary Yes Yes SAM Linux/Windows/Mac OS X BWA Reference (FM-index) Yes Arbitrary Yes Yes SAM Linux SOAP (3) Reference (FM-index) Yes Arbitrary No Yes SOAP2/3 Linux MAQ (0.6.6) Reference (Hashing reads) Yes ≤127 Yes Yes MAQ Linux/Solaris/Mac OS X Reference Novoalign (2.07.07) Yes Arbitrary Yes Yes SAM Linux/Mac OS X (Hashing reference) Reference Mosaik (1.1.0018) Yes Arbitrary Yes Yes SAM Linux/Windows/Mac OS X/Solaris (Hashing reference) Reference SHRiMP (2.2.2) Yes Arbitrary Yes Yes SAM Linux/Mac OS X (Hashing reference) Reference FASTA, ACE Mira (3.4) Yes Arbitrary Yes Yes Linux 1 2 Commercial software. Option for de novo assembly and modules included for variant calling. 6 International Journal of Plant Genomics customized, but it requires substantial computing memory and is not suited for large complex genomes. Of the freely available software, SOAPdenovo is one of the fastest read assembly programs and it uses a comparatively moderate amount of computing memory. The assembly generated by SOAPdenovo can be used for SNP discovery using SOAPsnp as implemented for the apple genome [77]. ABySS can be deployed on a computer cluster. It requires the least amount of memory and can be used for large genomes. Velvet requires the largest amount of memory. It can use mate-pair information to resolve and correct assembly errors. Figure 1: Graphical user interface of Tablet, an assembly visualiza- tion program, displays the reference genome on top and the mapped reads with color-coded SNPs on the bottom. 4. SNP Discovery The most common application of NGS is SNP discovery, whose downstream usefulness in linkage map construction, information). Tablet has a user-friendly interface and is genetic diversity analyses, association mapping, and marker- widely used because it supports a wide array of commonly assisted selection has been demonstrated in several species used file formats such as SAM, BAM, SOAP, ACE, FASTQ, [78]. NGS-derivedSNPshavebeenreportedinhumans[79], and FASTA generated by different read assemblers such as Drosophila [80], wheat [81, 82], eggplant [83], rice [84–86], Bowtie, BWA, SOAP, MAQ, and SeqMan NGen. It displays Arabidopsis [87, 88], barley [14, 89], sorghum [90], cotton contig overview, coverage information, read names and it [91], common beans [78], soybean [92], potato [93], flax allows searching for specific coordinates on scaffolds. [94], Aegilops tauschii [95], alfalfa [96], oat [97], and maize Broadly used SNP calling software include Samtools [98]tonameafew. [103], SNVer [104], and SOAPsnp [74]. Samtools is popular SNP discovery using NGS is readily accomplished in because of its various modules for file conversion (SAM small plant genomes for which good reference genomes are to BAM and vice-versa), mapping statistics, variant calling, availablesuchasriceand Arabidopsis [86, 99]. Although SNP and assembly visualization. Recently, SOAPsnp has gained discovery in complex genomes without a reference genome popularity because of its tight integration with SOAP aligner such as wheat [81, 82], barley [14, 89], oat [97], and beans and other SOAP modules which are constantly upgraded [78] can be achieved through NGS, several challenges remain and provide a one stop shop for the sequencing analysis in other nonmodel but economically important crops. The continuum. Variant calling algorithms such as Samtools and presence of repeat elements, paralogs, and incomplete or SNVer can be used as stand-alone programs or incorporated inaccurate reference genome sequences can create ambi- into pipelines for SNP calling. Reviews of SNP calling guities in SNP calling [63]. NGS read mapping can also software have been published [63, 105]. Some of the main suffer from sequencing error (erroneous base calling) and features of the current commonly used software are listed in misaligned reads. The following section focuses on programs Table 2 (refer to Table 4 for download information). tailored for SNP discovery and emphasizes some of the precautions and considerations to minimize erroneous SNP calling. 4.2. SNP Discovery from Multiple Individuals and Complex Genomes. SNP discovery is more robust when multiple and 4.1. Software and Pipelines for SNP Discovery. In theory, a divergent genotypes are used simultaneously, creating the SNP is identified when a nucleotide from an accession read necessary basis to capture the genetic variability of a species. differs from the reference genome at the same nucleotide Large parts of plant genomes consist of repetitive elements position. In the absence of a reference genome, this is [106] which can cause spurious SNP calling by erroneous achieved by comparing reads from different genotypes using read mapping to paralogous repeat element sequences. In de novo assembly strategies [95]. Read assembly files gener- polyploid genomes such as cotton (allotetraploid), homoe- ated by mapping programs are used to perform SNP calling. ologous sequences can cause similar misalignment [91]. In practice, various empirical and statistical criteria are used Improved read assembly and filtering of SNPs become even to call SNPs, such as a minimum and maximum number of more important factors for accurate SNP calling in these reads considering the read depth, the quality score and the cases because they can mitigate the effects of errors caused consensus base ratio for examples [95]. Thresholds for these by paralogs and homoeologs. criteria are adjusted based on the read length and the genome Read assembly algorithms such as Bowtie and SOAP as coverage achieved by the NGS data. In assemblies generated well as variant calling/genotyping softwares such as GATK allowing single nucleotide variants and insertions/deletions [107] are rapidly evolving to accommodate an ever increas- (indels), a list of SNP and indel coordinates is generated and ing number of reads, increased read length, nucleo- the read mapping results can be visualized using graphical tide quality values, and mate-pair information of PE reads. user interface programs such as Tablet [100](Figure 1), SNP- Assembly programs such as Novoalign (http://www.novo- VISTA [101], or Savant [102](referto Table 4 for download craft.com/main/index.php) and STAMPY [108], although International Journal of Plant Genomics 7 Table 2: Commonly used NGS variant calling software. Download information for these software is compiled in Table 4. A more compre- hensive list of variant calling programs is available at http://seqanswers.com/wiki/Software/list. Software Multisample support Reference Features Platform Include computation of genotype Samtools Yes Aligned reads Linux likelihoods and variant calling SOAPsnp No Variant database Part of SOAP3 for variant calling Linux Include variant caller, SNP filter, and GATK Yes Aligned reads Linux SNP quality calibrator Fast variant caller, assigning SNP Windows, Linux, SNVer Yes Aligned reads significance based on read depth Mac OS X Variant calling based on reference Linux, SHORE Yes Aligned reads sequence even from other species Mac OS X Variant calling with or without LD MaCH Yes Genotype likelihoods Windows, Linux, Mac OSX information Candidate SNPs and Variant calling and linkage map-based Windows, Linux, IMPUTE2 Yes genotype likelihoods SNP imputation Mac OS X memory and time intensive, are highly sensitive for simul- Melting (HRM) curve analysis. Validation can serve as an taneous mapping of short reads from multiple individuals iterative and informative process to modify and optimize the [105]. SNP filtering criteria to improve SNP calling. For example, SNP calls can be significantly improved using filtering a subset of 144 SNPs from a total of 2,113,120 SNPs were criteria that are specific to the genome characteristics and validated using the Goldengate assay on 160 accessions in the dataset. For instance, projects aimed at resequencing apple [77]. Another example is illustrated in Figure 2 where can compare different datasets from the same genotype and a KASPar assay was performed on 92 genotypes from a segre- thus eliminate data with large discrepancies. This strategy gating population illustrating the validation of a single “T/C” identifies the most common sources of error and is applied SNP in two distinct clusters. Other validation strategies in the 1000 genome project [109]. Reduced representation used in nonmodel organisms are tabulated in Garvin et al. libraries (RRLs), that is, sequencing an enriched subset [111]. With the continuously competitive pricing of NGS, of a genome by eliminating a proportion of its repetitive genotyping-by-sequencing (GBS) is becoming a viable SNP fractions [79], reduce the probability of misalignments validation method. Either biparental segregating populations to repeats and thus potential downstream erroneous SNP or a collection of diverse genotypes can be sequenced at a calling. Filtering criteria that can improve SNP accuracy reasonable cost using indexing, that is, combining multiple include (i) a minimum read depth (often ≥3per genotype), independently tagged genotypes in a single NGS run to (ii) >90% nucleotides within a genotype having identical obtain genome-wide or reduced representation genome call at a given position (∼<10% sequencing error), (iii) a sequences at a lower coverage but potentially validating a read depth ≤ mean of the sequence depth over the entire much larger number of SNPs than the methods described mapping assembly, (iv) the elimination of ribosomal DNA above. Sequencing of segregating populations or diverse and other repetitive elements in the 50 nt flanking any SNP genotypes may also lead to the discovery of additional SNPs. call, and (v) masking of homopolymer SNPs with a given The two major factors affecting the SNP validation rate base string length (often ≥2). Additionally, in polyploid are sequencing and read mapping errors as discussed above. species, separate assembly of homoeologs using stringent NGS platforms have different levels of sequencing accuracies, mapping parameters is often essential for genome-wide and this may be the most important factor determining SNP identification to avoid spurious SNP calls caused by the variation in the validation, from 88.2% for SOLiD erroneous homoeologous read mapping [91]. followed by Illumina at 85.4% and Roche 454 at 71% [95]. The SNP validation rates can be improved using RRL for SNP discovery and choosing SNPs within the nonrepetitive 4.3. SNP Validation. Prior to any SNP applications, the sequences including predicted single copy genes and single discovered SNPs must be validated to identify the true SNPs copy repeat junctions shown to have high validation rates and get an idea of the percentage of potentially false SNPs [95]. resulting from an SNP discovery exercise. The need for validation arises because a proportion of the discovered SNPs could have been wrongly called for various reasons including 5. SNP Genotyping those outlined above. SNP validation can be accomplished using a variety of material such as a biparental segregating SNP genotyping is the downstream application of SNP population or a diverse panel of genotypes. Usually a small discovery to identify genetic variations. SNP applica- subset of the SNPs is used for validation through assays tions include phylogenic analysis, marker-assisted selection, such as the Illumina Goldengate [110], KBiosciences Com- genetic mapping of quantitative trait loci (QTL), bulked petitive AlleleSpecific-PCR SNP genotyping system (KAS- segregant analysis, genome selection, and genome-wide Par) (http://www.lgcgenomics.com/) or the High Resolution association studies (GWAS). The number of SNPs and 8 International Journal of Plant Genomics but they are species-specific, expensive to design and require specific equipment and chemistry. PCR and primer extension technologies like KASPar and Taqman (http://www.lifetechnologies.com/global/en/home.html)are limited by their low SNP throughput but can be useful to assay a large number of genotypes with few SNPs. NGS technologies have become viable for genotyping studies and may offer advantages over other genotyping methods in cost and efficiency. 5.1. Genotyping-by-Sequencing (GBS). Therehavebeena number of approaches developed that use complexity reduc- tion strategies to lower the cost and simplify the discovery of SNP markers using NGS, RNA-Seq, complexity reduction of polymorphic sequences (CRoPS), restriction-site-associated DNA sequencing (RAD-Seq), and GBS [118]. Of these Figure 2: Validation of a T/C SNP by a KASPar assay (KBiosciences, methodologies GBS holds the greatest promise to serve Herts, England). Genotypes with a “T” are represented by black the widest base of plant researchers because of its ability dots with a white cross clustered in the upper left and those with a “C” by white dots with a black cross in the bottom right cluster. to allow simultaneous marker discovery and genotyping The two black dots near the bottom left are negative controls. No with low cost and a simple molecular biology workflow. heterozygous individuals were present in this population. Briefly, GBS involves digesting the genome of each individual in a population to be studied with a restriction enzyme [119]. One unique and one common adapter are ligated to the fragments and a PCR is carried out which is biased individuals to screen are of primary importance in choosing an SNP genotyping assay, though cost of the assay and/or towards amplifying smaller DNA fragments. The resulting equipment and the level of accuracy are also important PCR products are then pooled and sequenced using an considerations. Illumina platform. The amplicons are not fragmented so Illumina Goldengate is a commonly used genotyping only the ends of the PCR products are sequenced. The assay because of its flexibility in interrogating 96 to 3,072 unique adapter acts as an ID tag so sequencing reads can be SNP loci simultaneously (http://www.illumina.com/). HRM associated with an individual. The technique can be applied analysis is suitable for a few to an intermediate number to species with or without a reference genome. The choice of of SNPs and can be performed within a typical laboratory enzyme has an effect on the number of markers identified setting. KASPar and SNPline genotyping systems (http:// and the amount of sequence coverage required. The more www.lgcgenomics.com/) can be used for genotyping a frequent the restriction recognition site, the higher the few to thousands of SNPs in a laboratory setting. The number of fragments and therefore more potential markers. SNPline system is available in SNPlite or SNPline XL Use of more frequent cutters may necessitate greater amounts versions to allow flexibility in sample number and SNP of sequencing depending on the application. Poland et al. assays. The iPLEX Gold technology developed by Sequenom [120] recently demonstrated the use of two restriction (http://www.sequenom.com/) is based on the MassARRAY enzymes to perform GBS in bread wheat, a hexaploid system which uses primer extension chemistry and matrix- genome. assisted laser desorption/ionisation-time of flight (MALDI- GBS has the potential to be a truly revolutionary TOF) mass spectrometry for genotyping. technology in the arena of plant genomics. It brings high The iPLEX Gold system has gained acceptance due density genotyping to the vast majority of plant species to its high precision and cost-effective implementation. that, until now, have had almost no investment in genomics High throughput chip-based genotyping assays such as resources. With little capital investment requirement and the Affymetrix GeneChip arrays (http://www.affymetrix an affordable per sample cost, all plant researchers now .com/estore/) and the Illumina BeadChips (http://www.illu- have powerful genomic and genetic methodologies available mina.com/) are capable of validating up to a million SNPs to them. Uses of GBS include applications in marker per reaction across an entire genome. Detailed analyses of discovery, phylogenetics, bulked segregant analysis, QTL SNP genotyping assays and their features are reviewed in mapping in biparental lines, GWAS, and genome selection. Tsuchihashi and Dracopoli [112], Sobrino and Carracedo GBS can also be applied to fine mapping in candidate gene [113], Giancola et al. [114], Kim and Misra [115], Gupta discovery and be used to generate high-density SNP genetic et al. [116], and Ragoussis [117]. A list of the most commonly maps to assist in de novo genome assembly. We predict used genotyping assays describing the assay type, technology, tremendous advances in functional genomics and plant throughput, multiplexing ability, and relative scalability can be found in Table 3. breeding from the implementation of GBS because it is truly Array-based technologies such as Infinium and Gold- a democratizing application for NGS in nonmodel plant engate substantially improved SNP genotyping efficiency, systems. International Journal of Plant Genomics 9 Table 3: Commonly used genotyping platforms. Throughput Relative scale Name Assay type Technology Multiplexing (samples) (no. of SNP/no. of individuals) Oligo nucleotide Genechip Hybridization 96/5 days Up to 18 × 10 Small/large array Infinium II Hybridization Bead array Up to 128/5 days Up to 13 × 10 Large/small-large Goldengate Primer extension-ligation Bead array 172/3 days Up to 3,072 Medium/large Mass spectrometry iPlex Primer extension 3840/2.5 days Up to 40 Medium/large (MALDI-TOF) Taqman PCR Taqman probe Up to 1536/day Up to 256 Medium/medium Capillary Up to 1536/3 SNPlex PCR Up to 48 Medium/large electrophoresis days FRET quenching KASPar PCR Up to 96/day — Medium/large oligos Primer FRET quenching Invader annealing/endonuclease Up to 384/day Up to 200,000 Medium/large oligos digestion HRM PCR Melting curve analysis Up to 1536/day — Medium/large 6. Applications of SNPS maize [128] are practical examples of gene discovery through SNP-based genetic maps. NGS and SNP genotyping technologies have made SNPs the most widely used marker for genetic studies in plant species 6.2. Genome-Wide Association Mapping. Association map- such as Arabidopsis [121]and rice [122]. SNPs can help to ping (AM) panels provide a better resolution, consider decipher breeding pedigree, to identify genomic divergence numerous alleles, and may provide faster marker-trait associ- of species to elucidate speciation and evolution, and to ation than biparental populations [129]. AM, often referred associate genomic variations to phenotypic traits [85]. The to as linkage disequilibrium (LD) mapping, relies on the ease of SNP development, reasonable genotyping costs, and nonrandom association between markers and traits [130]. the sheer number of SNPs present within a collection of LD can vary greatly across a genome. In low LD regions, individuals allow an assortment of applications that can have high marker saturation is required to detect marker-trait a tremendous impact on basic and applied research in plant association, hence the need for densely saturated maps. In species. general, GWASs require 10,000–100,000 markers applied to a collection of genotypes representing a broad genetic basis 6.1. SNPs in Genetic Mapping. A genetic map refers to [130]. the arrangement of traits, genes, and markers relative to In the past few years, NGS technologies have led to the each other as measured by their recombination frequency. discovery of thousands, even millions of SNPs, and novel Genetic maps are essential tools in molecular breeding for application platforms have made it possible to produce plant genetic improvement as they enable gene localization, genome-wide haplotypes of large numbers of genotypes, map-based cloning, and the identification of QTL [123]. making SNPs the ideal marker for GWASs. So far, 951 GWASs SNPs have greatly facilitated the production of much higher have been reported in humans (http://www.bing.com/ density maps than previous marker systems. SNPs discovered search?q=www.genome.gov%2Fgwastudies%2F&src=ie9tr). using RNA-Seq and expressed sequence tags (ESTs) have the In plants, such a study was first reported in Arabidopsis added advantage of being gene specific [124]. Their high for flowering time and pathogen-resistance genes [131]. A abundance and rapidly improving genotyping technologies GWAS performed in rice using ∼3.6 million SNPs identified make SNPs an ideal marker type for generating new genetic genomic regions associated with 14 agronomic traits [132]. maps as well as saturating existing maps created with The genetic structure of northern leaf blight, southern leaf other markers. Most SNPs are biallelic thereby having a blight, and leaf architecture was studied using ∼1.6 million lower polymorphism information content (PIC) value as SNPs in maize [133–135]. SNP-based GWAS was also compared to most other marker types which are often performed on species such as barley for which a reference multiallelic [125]. The limited information associated with genome sequence is not available [136]. Soto-Cerda and their biallelic nature is greatly compensated by their high Cloutier [137] have reviewed the concepts, benefits, and frequency, and a map of 700–900 SNPs has been found to limitations of AM in plants. be equivalent to a map of 300–400 simple sequence repeat (SSR) markers [125]. SNP-based linkage maps have been constructed in many economically important species such as 6.3. Evolutionary Studies. SSRs and mitochondrial DNA have rice [126], cotton [91]and Brassica [127]. The identification been used in evolutionary studies since the early 1990s [138]. of candidate genes for flowering time in Brassica [127]and However, the biological inferences from results of these two 10 International Journal of Plant Genomics Table 4: Download information of software used for NGS data. Software Source Bowtie http://bowtie-bio.sourceforge.net/bowtie2/index.shtml BWA http://bio-bwa.sourceforge.net/ SOAP http://soap.genomics.org.cn/soap3.html#down2 MAQ http://sourceforge.net/projects/maq/ Novoalign http://www.novocraft.com/main/index.php CLC-Bio Genomics http://www.clcbio.com/index.php?id=1240 SeqMan NGen http://www.dnastar.com/t-products-seqman-ngen.aspx NextGENe http://softgenetics.com/NextGENe.html Mosaik http://bioinformatics.bc.edu/marthlab/Mosaik SHRiMP http://compbio.cs.toronto.edu/shrimp/ Mira http://sourceforge.net/projects/mira-assembler/files/MIRA/stable/ Cassava http://www.illumina.com/software/genome analyzer software.ilmn Newbler http://www.454.com/products/analysis-software/index.asp Novoalign http://www.novocraft.com/main/downloadpage.php Tablet http://bioinf.scri.ac.uk/tablet/ SNP-VISTA http://genome.lbl.gov/vista/snpvista/ Samtools http://sourceforge.net/projects/samtools/ Savant http://genomesavant.com/savant/download.php SOAPsnp http://soap.genomics.org.cn/soapsnp.html GATK http://www.broadinstitute.org/gsa/wiki/index.php/ The Genome Analysis Toolkit SNver http://snver.sourceforge.net/ MaCH http://www.sph.umich.edu/csg/abecasis/MACH/ IMPUTE2 http://mathgen.stats.ox.ac.uk/impute/impute v2.html# download impute2 MEGA http://www.megasoftware.net/ PHYLIP http://evolution.genetics.washington.edu/phylip.html marker types may be misinterpreted due to homoplasy, a polymorphism that can be detected in the population used. phenomenon in which similarity in traits or markers occurs This results in a collection of markers that sample only a due to reasons other than ancestry, such as convergent fraction of the diversity that exists in the species but that are evolution, evolutionary reversal, gene duplication, and hor- nevertheless used to infer relatedness and determine genetic izontal gene transfer [139]. The advantage of SNPs over distance for whole populations. Ideally, a set of SNP markers microsatellites and mitochondrial DNA resides in the fact randomly distributed throughout the genome would be that SNPs represent single base nucleotide substitutions and, developed for each population studied. GBS moves us closer as such, they are less affected by homoplasy because their to this goal by incorporating simultaneous discovery of origin can be explained by mutation models [140]. SNPs SNPs and genotyping of individuals. With this approach have been employed to quantify genetic variation, for indi- genome sample bias remains but can be mitigated by careful vidual identification, to determine parentage relatedness and restriction enzyme selection. population structure [138]. Seed shattering (or loss thereof) has been associated with an SNP through a GWAS aimed at unraveling the evolution of rice that led to its domestication 7. Future Perspectives [141]. SNPs have also been used to study the evolution of genessuchas WAG-2 in wheat [142]. Algorithms such as SNP discovery incontestably made a quantum leap forward neighbor-joining and maximum likelihood implemented in with the advent of NGS technologies and large numbers of the PHYLIP [143]and MEGA [144] software are commonly SNPs are now available from several genomes including large used to generate phylogenetic trees. and complex ones (see Section 4). Unlike model systems such The main advantage of SNPs is unquestionably their as humans and Arabidopsis, SNPs from crop plants remain large numbers. As with all marker systems the researcher limited for the time being, but broad access to reasonable must be aware of ascertainment biases that exist in the panel cost NGS promises to rapidly increase the production of of SNPs being used. These biases exist because SNPs are reference genome sequences as well as SNP discovery. Many often developed from examining a small group of individuals issues remain to be addressed, such as the ascertainment bias and selecting the markers that maximize the amount of of popular biparental populations and the low validation International Journal of Plant Genomics 11 rate of some array-based genotyping platforms [145]. The Scandinavian wolf population,” Molecular Ecology, vol. 14, no. 2, pp. 503–511, 2005. area of epigenetic regulation of various genome components [8] C.T.Smith,C.M.Elfstrom, L. W. Seeb,and J. E. Seeb,“Use can be better understood as accurate and deeper sequencing of sequence data from rainbow trout and Atlantic salmon for is achieved. RNA and ChIP-sequencing projects, similar to SNP detection in Pacific salmon,” Molecular Ecology, vol. 14, RNA-Seq in the nonmodel plant sweet cherry to identify no. 13, pp. 4193–4203, 2005. SNPs and haplotypes [146], can be undertaken to study [9] B.N.Chorley,X.Wang, M. R. Campbell,G.S.Pittman, functional genomics. A great deal of knowledge that is still M. A. Noureddine, and D. A. Bell, “Discovery and veri- elusive about the noncoding and repetitive elements can fication of functional single nucleotide polymorphisms in be determined with the next wave of modern and efficient regulatory genomic regions: current and developing tech- sequencing technologies. nologies,” Mutation Research, vol. 659, no. 1-2, pp. 147–157, The first (Sanger) and the second (next) generation sequencing technologies have enabled researchers to char- [10] K. Faber, K. H. Glatting, P. J. Mueller, A. Risch, and A. Hotz- Wagenblatt, “Genome-wide prediction of splice-modifying acterize DNA sequence variation, sequence entire genomes, SNPs in human genes using a new analysis pipeline called quantify transcript abundance, and understand mechanisms AASsites,” BMC Bioinformatics, vol. 12, supplement 4, article such as alternative splicing and epigenetic regulation [29]. S2, 2011. Numerous plant genomes are now sequenced at various [11] S. Atwell,Y.S.Huang,B.J.Vilhjalmsson ´ et al., “Genome- levels of completion and many more are underway [72]. The wide association study of 107 phenotypes in Arabidopsis thal- NGS technologies have made SNP discovery affordable even iana inbred lines,” Nature, vol. 465, no. 7298, pp. 627–631, in complex genomes and the technologies themselves have improved tremendously in the past decade. Improvements [12] W. B. Barbazuk,S.J.Emrich, H. D. Chen,L.Li, andP.S. in TGS promise synergies with NGS technologies to further Schnable, “SNP discovery via 454 transcriptome sequenc- assist our understanding of plant genetics and genomics. ing,” Plant Journal, vol. 51, no. 5, pp. 910–918, 2007. NGS has revolutionized genomics-related research, and it is [13] A. Ching, K. S. Caldwell, M. Jung et al., “SNP frequency, our belief that the NGS-enabled discoveries will continue in haplotype structure and linkage disequilibrium in elite maize inbred lines,” BMC Genetics, vol. 3, article 19, 2002. the next decade. [14] T. J. Close, P. R. Bhat, S. Lonardi et al., “Development and im- plementation of high-throughput SNP genotyping in barley,” Acknowledgments BMC Genomics, vol. 10, article 582, 2009. [15] X. Xu, X. Liu, S. Ge et al., “Resequencing 50 accessions of The authors are grateful to Andrzej Walichnowski for help cultivated and wild rice yields markers for identifying agro- with paper editing, Joanne Schiavoni for formatting, and nomically important genes,” Nature Biotechnology, vol. 30, Michael Shillinglaw for figure preparation. This chapter was no. 1, pp. 105–111, 2012. [16] S. Kaul, H. L. Koo, J. Jenkins et al., “Analysis of the genome written within the scope of the Genome Canada TUFGEN sequence of the flowering plant Arabidopsis thaliana,” Nature, project, and support from all funding partners is gratefully vol. 408, no. 6814, pp. 796–815, 2000. acknowledged. [17] S. A. Goff, D. Ricke, T. H. Lan et al., “A draft sequence of the rice genome (Oryza sativa L. ssp. japonica),” Science, vol. 296, no. 5565, pp. 92–100, 2002. References [18] J. Yu, S. Hu, J. Wang et al., “A draft sequence of the rice genome (Oryza sativa L. ssp. indica),” Science, vol. 296, no. [1] K. A. Frazer, D. G. Ballinger, D. R. Cox et al., “A second gen- eration human haplotype map of over 3.1 million SNPs,” 5565, pp. 79–92, 2002. [19] J. A. Shendure, G. J. Porreca, and G. M. Church, “Overview Nature, vol. 449, no. 7164, pp. 851–861, 2007. [2] C. H. Brenner and B. S. Weir, “Issues and strategies in of DNA sequencing strategies,” Current Protocols in Molecular Biology, chapter 7, no. 81, pp. 7.1.1–7.1.11, 2008. the DNA identification of World Trade Center victims,” Theoretical Population Biology, vol. 63, no. 3, pp. 173–178, [20] F. Sanger, S. Nicklen, and A. R. Coulson, “DNA sequenc- ing with chain-terminating inhibitors,” Proceedings of the [3] M.I.McCarthy, G. R. Abecasis,L.R.Cardonetal., “Genome- National Academy of Sciences of the United States of America, vol. 74, no. 12, pp. 5463–5467, 1977. wide association studies for complex traits: consensus, uncertainty and challenges,” Nature Reviews Genetics, vol. 9, [21] F. Sanger, G. M. Air, B. G. Barrell et al., “Nucleotide sequence no. 5, pp. 356–369, 2008. of bacteriophage phiX174 DNA,” Nature, vol. 265, no. 5596, pp. 687–695, 1977. [4] Z.J.Liu andJ.F.Cordes, “DNA marker technologies and their applications in aquaculture genetics,” Aquaculture, vol. [22] A. M. Maxam and W. Gilbert, “A new method for sequencing DNA,” Proceedings of the National Academy of Sciences of the 238, no. 1–4, pp. 1–37, 2004. [5] L.R.Schaeffer, “Strategy for applying genome-wide selection United States of America, vol. 74, no. 2, pp. 560–564, 1977. [23] M. Kircher and J. Kelso, “High-throughput DNA sequen- in dairy cattle,” Journal of Animal Breeding and Genetics, vol. 123, no. 4, pp. 218–223, 2006. cing—concepts and limitations,” BioEssays,vol. 32, no.6,pp. 524–536, 2010. [6] H. Yu, W. Xie, J. Wang et al., “Gains in QTL detection using an ultra-high density SNP map based on population [24] M. Ronaghi, M. Uhlen, ´ and P. Nyren, ´ “A sequencing method based on real-time pyrophosphate,” Science, vol. 281, no. sequencing relative to traditional RFLP/SSR markers,” PLoS ONE, vol. 6, no. 3, Article ID e17595, 2011. 5375, pp. 363–365, 1998. [25] M. Ronaghi, S. Karamohamed, B. Pettersson, M. Uhlen, ´ and [7] J.M.Seddon, H. G. Parker,E.A.Ostrander,and H. Ellegren, “SNPs in ecological and conservation studies: a test in the P. Nyren, “Real-time DNA sequencing using detection of 12 International Journal of Plant Genomics pyrophosphate release,” Analytical Biochemistry, vol. 242, no. [44] M. Trick, Y. Long,J.Meng, andI.Bancroft, “Single nucleotide 1, pp. 84–89, 1996. polymorphism (SNP) discovery in the polyploid Brassica napus using Solexa transcriptome sequencing,” Plant Biotech- [26] T. C. Glenn, “Field guide to next-generation DNA sequen- cers,” Molecular Ecology Resources, vol. 11, no. 5, pp. 759–769, nology Journal, vol. 7, no. 4, pp. 334–346, 2009. 2011. [45] S. S. Yang, Z. J. Tu, F. Cheung et al., “Using RNA-Seq for gene identification, polymorphism detection and transcript [27] G. Turcatti, A. Romieu, M. Fedurco, and A. P. Tairi, “A new class of cleavable fluorescent nucleotides: synthesis and profiling in two alfalfa genotypes with divergent cell wall composition in stems,” BMC Genomics, vol. 12, no. 1, article optimization as reversible terminators for DNA sequencing by synthesis,” Nucleic Acids Research, vol. 36, no. 4, article e25, 199, 2011. 2008. [46] F. Ozsolak, D. T. Ting, B. S. Wittner et al., “Amplification-free digital gene expression profiling from minute cell quantities,” [28] J. Shendure,G.J.Porreca,N.B.Reppasetal., “Molecular biology: accurate multiplex polony sequencing of an evolved Nature Methods, vol. 7, no. 8, pp. 619–621, 2010. bacterial genome,” Science, vol. 309, no. 5741, pp. 1728–1732, [47] Z. Wang, M. Gerstein, and M. Snyder, “RNA-Seq: a revolu- 2005. tionary tool for transcriptomics,” Nature Reviews Genetics, [29] E. E. Schadt, S. Turner, and A. Kasarskis, “A window into vol. 10, no. 1, pp. 57–63, 2009. third-generation sequencing,” Human Molecular Genetics, [48] H. Xu, Y. Gao, and J. Wang, “Transcriptomic analysis of vol. 19, no. 2, pp. R227–R240, 2010. rice (Oryza sativa) developing embryos using the RNA-Seq [30] T. D. Harris, P. R. Buzby, H. Babcock et al., “Single-molecule technique,” PLoS ONE, vol. 7, no. 2, Article ID e30646, 2012. DNA sequencing of a viral genome,” Science, vol. 320, no. [49] J. D. Roberts, B. D. Preston, L. A. Johnston, A. Soni, L. A. 5872, pp. 106–109, 2008. Loeb, and T. A. Kunkel, “Fidelity of two retroviral reverse [31] C. S. Pareek, R. Smoczynski, and A. Tretyn, “Sequencing transcriptases during DNA-dependent DNA synthesis in vitro,” Molecular and Cellular Biology, vol. 9, no. 2, pp. 469– technologies and genome sequencing,” Journal of Applied Genetics, vol. 52, no. 4, pp. 413–435, 2011. 476, 1989. [32] J. Eid, A. Fehr, J. Gray et al., “Real-time DNA sequencing [50] U. Gubler, “Second-strand cDNA synthesis: mRNA frag- mentsasprimers,” Methods in Enzymology, vol. 152, pp. 330– from single polymerase molecules,” Science, vol. 323, no. 5910, pp. 133–138, 2009. 335, 1987. [33] S. Koren, M. C. Schatz, B. P. Walenz et al., “Hybrid error [51] J. Cocquet, A. Chong, G. Zhang, and R. A. Veitia, “Reverse transcriptase template switching and false alternative tran- correction and de novo assembly of single-molecule sequenc- ing reads,” Nature Biotechnology, vol. 30, no. 7, pp. 693–700, scripts,” Genomics, vol. 88, no. 1, pp. 127–131, 2006. 2012. [52] F. Ozsolak, A. R. Platt, D. R. Jones et al., “Direct RNA sequencing,” Nature, vol. 461, no. 7265, pp. 814–818, 2009. [34] F. Ribeiro, D. Przybylski, S. Yin et al., “Finished bacterial genomes from shotgun sequence data,” Genome Research.In [53] M. J. Solomon, P. L. Larsen, and A. Varshavsky, “Mapping press. protein-DNA interactions in vivo with formaldehyde: evi- dence that histone H4 is retained on a highly transcribed [35] A. Bashir, A. A. Klammer, W. P. Robins et al., “A hybrid approach for the automated finishing of bacterial genomes,” gene,” Cell, vol. 53, no. 6, pp. 937–947, 1988. Nature Biotechnology, vol. 30, no. 7, pp. 701–707, 2012. [54] T. S. Mikkelsen, M. Ku, D. B. Jaffe et al., “Genome-wide maps of chromatin state in pluripotent and lineage-committed [36] X. Zhang, K. W. Davenport, W. Gu et al., “Improving genome assemblies by sequencing PCR products with PacBio,” cells,” Nature, vol. 448, no. 7153, pp. 553–560, 2007. BioTechniques, vol. 53, no. 1, pp. 61–62, 2012. [55] P. Ng,J.J.Tan,H.S.Ooi et al., “Multiplex sequencing of paired-end ditags (MS-PET): a strategy for the ultra-high- [37] P. Kothiyal,S.Cox,J.Ebert,B.J.Aronow, J. H. Greinwald, and H. L. Rehm, “An overview of custom array sequencing,” throughput analysis of transcriptomes and genomes,” Nucleic Acids Research, vol. 34, no. 12, p. e84, 2006. Current Protocols in Human Genetics, no. 61, chapter 7, pp. 7.17.1–17.17.11, 2009. [56] G. Robertson, M. Hirst, M. Bainbridge et al., “Genome- [38] J. D. McPherson, “Next-generation gap,” Nature Methods, wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing,” vol. 6, no. 11, supplement, pp. S2–S5, 2009. Nature Methods, vol. 4, no. 8, pp. 651–657, 2007. [39] A. P. M. Weber, K. L. Weber, K. Carr, C. Wilkerson, and J. B. Ohlrogge, “Sampling the arabidopsis transcriptome with [57] B. Giardine, C. Riemer, R. C. Hardison et al., “Galaxy: a platform for interactive large-scale genome analysis,” Genome massively parallel pyrosequencing,” Plant Physiology, vol. 144, no. 1, pp. 32–42, 2007. Research, vol. 15, no. 10, pp. 1451–1455, 2005. [40] M. Libault, A. Farmer, T. Joshi et al., “An integrated [58] W. Wang, Z. Wei, T.-W. Lam, and J. Wang, “Next generation sequencing has lower sequence coverage and poorer SNP- transcriptome atlas of the crop model Glycine max, and its use in comparative analyses in plants,” Plant Journal, vol. 63, detection capability in the regulatory regions,” Scientific no. 1, pp. 86–99, 2010. Reports, vol. 1, article 55, 2011. [41] T. Lu, G. Lu, D. Fan et al., “Function annotation of the rice [59] B. Langmead,C.Trapnell, M. Pop, andS.L.Salzberg, transcriptome at single-nucleotide resolution by RNA-seq,” “Ultrafast and memory-efficient alignment of short DNA Genome Research, vol. 20, no. 9, pp. 1238–1249, 2010. sequences to the human genome,” Genome Biology, vol. 10, [42] W. B. Barbazuk, S. Emrich, and P. S. Schnable, “SNP no. 3, article R25, 2009. mining from maize 454 EST sequences,” Cold Spring Harbor [60] H. Li and R. Durbin, “Fast and accurate short read alignment Protocols. In press. with Burrows-Wheeler transform,” Bioinformatics, vol. 25, [43] E. Novaes,D.R.Drost,W.G.Farmerieetal., “High- no. 14, pp. 1754–1760, 2009. throughput gene and SNP discovery in Eucalyptus grandis,an [61] R. Li, C. Yu, Y. Li et al., “SOAP2: an improved ultrafast tool uncharacterized genome,” BMC Genomics, vol. 9, article 312, for short read alignment,” Bioinformatics, vol. 25, no. 15, pp. 2008. 1966–1967, 2009. International Journal of Plant Genomics 13 [62] H. Li and N. Homer, “A survey of sequence alignment Drosophila,” Nature Genetics, vol. 29, no. 4, pp. 475–481, algorithms for next-generation sequencing,” Briefings in 2001. Bioinformatics, vol. 11, no. 5, Article ID bbq015, pp. 473–483, [81] A. M. Allen, G. L. Barker, S. T. Berry et al., “Transcript-spe- 2010. cific, single-nucleotide polymorphism discovery and linkage [63] T. J. Treangen and S. L. Salzberg, “Repetitive DNA and next- analysis in hexaploid bread wheat (Triticum aestivum L.),” generation sequencing: computational challenges and solu- Plant Biotechnology Journal, vol. 9, no. 9, pp. 1086–1099, tions,” Nature Reviews Genetics, vol. 13, no. 1, pp. 36–46, 2011. [82] D. Trebbi, M. Maccaferri, P. de Heer et al., “High-throughput [64] R. McLendon, A. Friedman, D. Bigner et al., “Comprehensive SNP discovery and genotyping in durum wheat (Triticum genomic characterization defines human glioblastoma genes durum Desf.),” Theoretical and Applied Genetics, vol. 123, no. and core pathways,” Nature, vol. 455, no. 7216, pp. 1061– 4, pp. 555–569, 2011. 1068, 2008. [83] L. Barchi, S. Lanteri, E. Portis et al., “Identification of SNP [65] S. M. Rumble, P. Lacroute, A. V. Dalca, M. Fiume, A. Sidow, and SSR markers in eggplant using RAD tag sequencing,” and M. Brudno, “SHRiMP: accurate mapping of short color- BMC Genomics, vol. 12, article 304, 2011. space reads,” PLoS Computational Biology, vol. 5, no. 5, [84] F. A. Feltus,J.Wan,S.R.Schulze,J.C.Estill, N. Jiang, andA. Article ID e1000386, 2009. H. Paterson, “An SNP resource for rice genetics and breeding [66] S. Rounsley, P. R. Marri, Y. Yu et al., “De novo next generation based on subspecies Indica and Japonica genome alignments,” sequencing of plant genomes,” Rice, vol. 2, no. 1, pp. 35–43, Genome Research, vol. 14, no. 9, pp. 1812–1819, 2004. [85] K. L. McNally, K. L. Childs, R. Bohnert et al., “Genomewide [67] T. Sasaki,“Themap-based sequence of thericegenome,” SNP variation reveals relationships among landraces and Nature, vol. 436, no. 7052, pp. 793–800, 2005. modern varieties of rice,” Proceedings of the National Academy [68] E. Pennisi, “Plant sciences: corn genomics pops wide open,” of Sciences of the United States of America, vol. 106, no. 30, pp. Science, vol. 319, no. 5868, p. 1333, 2008. 12273–12278, 2009. [69] G. A. Tuskan, S. DiFazio, S. Jansson et al., “The genome [86] T. Yamamoto, H. Nagasaki, J. I. Yonemaru et al., “Fine of black cottonwood, Populus trichocarpa (Torr. & Gray),” definition of the pedigree haplotypes of closely related rice Science, vol. 313, no. 5793, pp. 1596–1604, 2006. cultivars by means of genome-wide discovery of single- [70] O. Jaillon, J. M. Aury, B. Noel et al., “The grapevine genome nucleotide polymorphisms,” BMC Genomics,vol. 11, no.1, article 267, 2010. sequence suggests ancestral hexaploidization in major angi- osperm phyla,” Nature, vol. 449, no. 7161, pp. 463–467, 2007. [87] G. Jander,S.R.Norris, S. D. Rounsley, D. F. Bush,I.M.Levin, [71] A. H. Paterson, J. E. Bowers, R. Bruggmann et al., “The and R. L. Last, “Arabidopsis map-based cloning in the post- genome era,” Plant Physiology, vol. 129, no. 2, pp. 440–450, Sorghum bicolor genome and the diversification of grasses,” Nature, vol. 457, no. 7229, pp. 551–556, 2009. 2002. [72] C. Feuillet, J. E. Leach, J. Rogers, P. S. Schnable, and K. [88] X. Zhang and J. O. Borevitz, “Global analysis of allele-specific expression in Arabidopsis thaliana,” Genetics, vol. 182, no. 4, Eversole, “Crop genome sequencing: lessons and rationales,” Trends in Plant Science, vol. 16, no. 2, pp. 77–88, 2011. pp. 943–954, 2009. [73] B. Chevreux, T. Pfisterer, B. Drescher et al., “Using the [89] R. Waugh, J. L. Jannink, G. J. Muehlbauer, and L. Ramsay, “The emergence of whole genome association scans in miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs,” barley,” Current Opinion in Plant Biology,vol. 12, no.2,pp. 218–222, 2009. Genome Research, vol. 14, no. 6, pp. 1147–1159, 2004. [90] J. C. Nelson, S. Wang, Y. Wu et al., “Single-nucleotide poly- [74] R. Li, Y. Li, X. Fang et al., “SNP detection for massively parallel whole-genome resequencing,” Genome Research, vol. morphism discovery by high-throughput sequencing in sorghum,” BMC Genomics, vol. 12, article 352, 2011. 19, no. 6, pp. 1124–1132, 2009. [91] R. L. Byers, D. B. Harker,S.M.Yourstone,P.J.Maughan, [75] J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. M. Jones, and I. Birol, “ABySS: a parallel assembler for short read and J. A. Udall, “Development and mapping of SNP assays in allotetraploid cotton,” Theoretical and Applied Genetics, vol. sequence data,” Genome Research, vol. 19, no. 6, pp. 1117– 1123, 2009. 124, no. 7, pp. 1201–1214, 2012. [76] D. R. Zerbino and E. Birney, “Velvet: algorithms for de novo [92] D. L. Hyten, S. B. Cannon, Q. Song et al., “High-throughput SNP discovery through deep resequencing of a reduced short read assembly using de Bruijn graphs,” Genome Research, vol. 18, no. 5, pp. 821–829, 2008. representation library to anchor and orient scaffolds in the soybean whole genome sequence,” BMC Genomics, vol. 11, [77] D. Chagne, ´ R. N. Crowhurst, M. Troggio et al., “Genome- no. 1, article 38, 2010. wide SNP detection, validation, and development of an 8K SNP array for apple,” PLoS ONE, vol. 7, no. 2, Article ID [93] J. P. Hamilton, C. N. Hansey, B. R. Whitty et al., “Single e31745, 2012. nucleotide polymorphism discovery in elite north American [78] A. J. Corte´s,M.C.Chavarro, andM.W.Blair,“SNP potato germplasm,” BMC Genomics, vol. 12, article 302, 2011. marker diversity in common bean (Phaseolus vulgaris L.),” [94] Y.-B. Fu and G. W. Peterson, “Developing genomic resources Theoretical and Applied Genetics, vol. 123, no. 5, pp. 827–845, in two Linum species via 454 pyrosequencing and genomic 2011. reduction,” Molecular Ecology Resources,vol. 12, no.3,pp. 492–500, 2012. [79] D. Altshuler, V. J. Pollara, C. R. Cowles et al., “An SNP map of the human genome generated by reduced representation [95] F. M. You, N. Huo, K. R. Deal et al., “Annotation-based shotgun sequencing,” Nature, vol. 407, no. 6803, pp. 513–516, genome-wide SNP discovery in the large and complex 2000. genome using next-generation sequencing Aegilops tauschii [80] J. Berger, T. Suzuki, K. A. Senti, J. Stubbs, G. Schaffner, without a reference genome sequence,” BMC Genomics, vol. 12, article 59, 2011. and B. J. Dickson, “Genetic mapping with SNP markers in 14 International Journal of Plant Genomics [96] Y. Han, Y. Kang, I. Torres-Jerez et al., “Genome-wide SNP [114] S. Giancola, H. I. McKhann, A. Ber ´ ard et al., “Utilization discovery in tetraploid alfalfa using 454 sequencing and high of the three high-throughput SNP genotyping methods, resolution melting analysis,” BMC Genomics, vol. 12, p. 350, the GOOD assay, Amplifluor and TaqMan, in diploid and 2011. polyploid plants,” Theoretical and Applied Genetics, vol. 112, no. 6, pp. 1115–1124, 2006. [97] R. E. Oliver, G. R. Lazo, J. D. Lutz et al., “Model SNP devel- opment for complex genomes based on hexaploid oat [115] S. Kim and A. Misra, “SNP genotyping: technologies and biomedical applications,” Annual Review of Biomedical Engi- using high-throughput 454 sequencing technology,” BMC Genomics, vol. 12, no. 1, article 77, 2011. neering, vol. 9, pp. 289–320, 2007. [116] P. K. Gupta, S. Rustgi, and R. R. Mir, “Array-based high- [98] E. Jones, W. C. Chu, M. Ayele et al., “Development of throughput DNA markers for crop improvement,” Heredity, single nucleotide polymorphism (SNP) markers for use in vol. 101, no. 1, pp. 5–18, 2008. commercial maize (Zea mays L.) germplasm,” Molecular [117] J. Ragoussis, “Genotyping technologies for genetic research,” Breeding, vol. 24, no. 2, pp. 165–176, 2009. Annual Review of Genomics and Human Genetics, vol. 10, pp. [99] S. Ossowski, K. Schneeberger, R. M. Clark, C. Lanz, N. 117–133, 2009. Warthmann, and D. Weigel, “Sequencing of natural strains [118] J. W. Davey, P. A. Hohenlohe, P. D. Etter, J. Q. Boone, J. M. of Arabidopsis thaliana with short reads,” Genome Research, Catchen, and M. L. Blaxter, “Genome-wide genetic marker vol. 18, no. 12, pp. 2024–2033, 2008. discovery and genotyping using next-generation sequenc- [100] I. Milne, M. Bayer, L. Cardle et al., “Tablet-next generation ing,” Nature Reviews Genetics, vol. 12, no. 7, pp. 499–510, sequence assembly visualization,” Bioinformatics, vol. 26, no. 3, pp. 401–402, 2009. [119] R. J. Elshire, J. C. Glaubitz, Q. Sun et al., “A robust, simple [101] N. Shah, M. V. Teplitsky, S. Minovitsky et al., “SNP-VISTA: genotyping-by-sequencing (GBS) approach for high diver- an interactive SNP visualization tool,” BMC Bioinformatics, sity species,” PLoS ONE, vol. 6, no. 5, Article ID e19379, 2011. vol. 6, no. 1, article 292, 2005. [120] J. A. Poland, P. J. Brown, M. E. Sorrells, and J.-L. Jannink, [102] M. Fiume, V. Williams, A. Brook, and M. Brudno, “Savant: “Development of high-density genetic maps for barley and genome browser for high-throughput sequencing data,” wheat using a novel two-enzyme genotyping-by-sequencing Bioinformatics, vol. 26, no. 16, Article ID btq332, pp. 1938– approach,” PLoS ONE, vol. 7, no. 2, Article ID e32253, 2012. 1944, 2010. [121] M. W. Horton, A. M. Hancock, Y. S. Huang et al., “Genome- [103] H. Li, B. Handsaker, A. Wysoker et al., “The sequence wide patterns of genetic variation in worldwide Arabidopsis alignment/map format and SAMtools,” Bioinformatics, vol. thaliana accessions from the RegMap panel,” Nature Genetics, 25, no. 16, pp. 2078–2079, 2009. vol. 44, no. 2, pp. 212–216, 2012. [104] Z. Wei, W. Wang, P. Hu, G. J. Lyon, and H. Hakonarson, [122] G. K. Subbaiyan, D. L. E. Waters, S. K. Katiyar, A. R. “SNVer: a statistical tool for variant calling in analysis Sadananda, S. Vaddadi, and R. J. Henry, “Genome-wide DNA of pooled or individual next-generation sequencing data,” polymorphisms in elite indica rice inbreds discovered by Nucleic acids research, vol. 39, no. 19, article e132, 2011. whole-genome sequencing,” Plant Biotechnology Journal, vol. [105] R. Nielsen, J. S. Paul, A. Albrechtsen, and Y. S. Song, “Geno- 10, no. 6, pp. 623–634, 2012. type and SNP calling from next-generation sequencing data,” [123] J. C. Nelson, “Methods and software for genetic mapping,” in Nature Reviews Genetics, vol. 12, no. 6, pp. 443–451, 2011. The Handbook of Plant Genome Mapping, pp. 53–74, Wiley- [106] R. Ragupathy, R. Rathinavelu, and S. Cloutier, “Physical VCH, Weinheim, Germany, 2005. mapping and BAC-end sequence analysis provide initial [124] A. Rafalski, “Applications of single nucleotide polymor- insights into the flax (Linum usitatissimum L.) genome,” phisms in crop genetics,” Current Opinion in Plant Biology, BMC Genomics, vol. 12, article 217, 2011. vol. 5, no. 2, pp. 94–100, 2002. [107] A. McKenna, M. Hanna, E. Banks et al., “The genome anal- [125] L. Kruglyak, “The use of a genetic map of biallelic markers ysis toolkit: a MapReduce framework for analyzing next- in linkage studies,” Nature Genetics, vol. 17, no. 1, pp. 21–24, generation DNA sequencing data,” Genome Research, vol. 20, no. 9, pp. 1297–1303, 2010. [126] W. Xie, Q. Feng, H. Yu et al., “Parent-independent geno- [108] G. Lunter and M. Goodson, “Stampy: a statistical algorithm typing for constructing an ultrahigh-density linkage map for sensitive and fast mapping of Illumina sequence reads,” based on population sequencing,” Proceedings of the National Genome Research, vol. 21, no. 6, pp. 936–939, 2011. Academy of Sciences of the United States of America, vol. 107, [109] R. M. Durbin, “A map of human genome variation from no. 23, pp. 10578–10583, 2010. population-scale sequencing,” Nature, vol. 467, no. 7319, pp. [127] F. Li, H. Kitashiba, K. Inaba, and T. Nishio, “A Brassica rapa 1061–1073, 2010. linkage map of EST-based SNP markers for identification [110] J. B. Fan, M. S. Chee, and K. L. Gunderson, “Highly parallel of candidate genes controlling flowering time and leaf genomic assays,” Nature Reviews Genetics, vol. 7, no. 8, pp. morphological traits,” DNA Research, vol. 16, no. 6, pp. 311– 632–644, 2006. 323, 2009. [111] M. R. Garvin, K. Saitoh, and A. J. Gharrett, “Application of [128] E. S. Buckler, J. B. Holland, P. J. Bradbury et al., “The genetic single nucleotide polymorphisms to non-model species: a architecture of maize flowering time,” Science, vol. 325, no. technical review,” Molecular Ecology Resources, vol. 10, no. 6, 5941, pp. 714–718, 2009. pp. 915–934, 2010. [129] S. A. Flint-Garcia, J. M. Thornsberry, and S. B. Edward, [112] Z. Tsuchihashi and N. C. Dracopoli, “Progress in high “Structure of linkage disequilibrium in plants,” Annual throughput SNP genotyping methods,” Pharmacogenomics Review of Plant Biology, vol. 54, pp. 357–374, 2003. Journal, vol. 2, no. 2, pp. 103–110, 2002. [130] P. K. Gupta, S. Rustgi, and P. L. Kulwal, “Linkage disequilib- [113] B. Sobrino and A. Carracedo, “SNP typing in forensic rium and association studies in higher plants: present status genetics: a review,” Methods in Molecular Biology, vol. 297, and future prospects,” Plant Molecular Biology, vol. 57, no. 4, pp. 107–126, 2005. pp. 461–485, 2005. International Journal of Plant Genomics 15 [131] M. J. Aranzana, S. Kim, K. Zhao et al., “Genome-wide association mapping in Arabidopsis identifies previously known flowering time and pathogen resistance genes,” PLoS Genetics, vol. 1, no. 5, p. e60, 2005. [132] X. Huang, X. Wei, T. Sang et al., “Genome-wide asociation studies of 14 agronomic traits in rice landraces,” Nature Genetics, vol. 42, no. 11, pp. 961–967, 2010. [133] K. L. Kump, P. J. Bradbury, R. J. Wisser et al., “Genome-wide association study of quantitative resistance to southern leaf blight in the maize nested association mapping population,” Nature Genetics, vol. 43, no. 2, pp. 163–168, 2011. [134] J. A. Poland, P. J. Bradbury, E. S. Buckler, and R. J. Nelson, “Genome-wide nested association mapping of quantitative resistance to northern leaf blight in maize,” Proceedings of the National Academy of Sciences of the United States of America, vol. 108, no. 17, pp. 6893–6898, 2011. [135] F. Tian, P. J. Bradbury, P. J. Brown et al., “Genome-wide association study of leaf architecture in the maize nested association mapping population,” Nature Genetics, vol. 43, no. 2, pp. 159–162, 2011. [136] R. K. Pasam, R. Sharma, M. Malosetti et al., “Genome-wide association studies for agronomical traits in a world wide spring barley collection,” BMC Plant Biology, vol. 12, article 16, 2012. [137] B. J. Soto-Cerda and S. Cloutier, “Association mapping in plant genomes,” in Genetic Diversity in Plants,M.C¸alis¸kan, Ed., pp. 29–54, InTech, 2012. [138] P. A. Morin, G. Luikart, and R. K. Wayne, “SNPs in ecology, evolution and conservation,” Trends in Ecology and Evolution, vol. 19, no. 4, pp. 208–216, 2004. [139] P. W. Hedrick, “Perspective: highly variable loci and their interpretation in evolution and conservation,” Evolution, vol. 53, no. 2, pp. 313–318, 1999. [140] A. Vignal, D. Milan, M. SanCristobal, and A. Eggen, “A review on SNP and other types of molecular markers and their use in animal genetics,” Genetics Selection Evolution, vol. 34, no. 3, pp. 275–305, 2002. [141] S. Konishi, T. Izawa, S. Y. Lin et al., “An SNP caused loss of seed shattering during rice domestication,” Science, vol. 312, no. 5778, pp. 1392–1396, 2006. [142] O. Wei, Z. Peng, Y. Zhou, Z. Yang, K. Wu, and Z. Ouyang, “Nucleotide diversity and molecular evolution of the WAG- 2geneincommonwheat (Triticum aestivum L.) and its relatives,” Genetics and Molecular Biology,vol. 34, no.4,pp. 606–615, 2011. [143] J. D. Retief, “Phylogenetic analysis using PHYLIP,” Methods in Molecular Biology, vol. 132, pp. 243–258, 2000. [144] K. Tamura, J. Dudley, M. Nei, and S. Kumar, “MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0,” Molecular Biology and Evolution, vol. 24, no. 8, pp. 1596–1599, 2007. [145] M. W. Ganal, T. Altmann, and M. S. Roder ¨ , “SNP identifica- tion in crop plants,” Current Opinion in Plant Biology, vol. 12, no. 2, pp. 211–217, 2009. [146] T. Koepke, S. Schaeffer, V. Krishnan et al., “Rapid gene- based SNP and haplotype marker development in non- model eukaryotes using 3’UTR sequencing,” BMC Genomics, vol. 13, no. 1, article 18, 2012. International Journal of Peptides Advances in International Journal of BioMed Stem Cells Virolog y Research International International Genomics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Journal of Nucleic Acids International Journal of Zoology Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Submit your manuscripts at http://www.hindawi.com The Scientific Journal of Signal Transduction World Journal Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 International Journal of Advances in Genetics Anatomy Biochemistry Research International Research International Microbiology Research International Bioinformatics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Enzyme Journal of International Journal of Molecular Biology Archaea Research Evolutionary Biology International Marine Biology Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Plant Genomics Hindawi Publishing Corporation

SNP Discovery through Next-Generation Sequencing and Its Applications

Loading next page...
 
/lp/hindawi-publishing-corporation/snp-discovery-through-next-generation-sequencing-and-its-applications-wVxxQW02vw
Publisher
Hindawi Publishing Corporation
Copyright
Copyright © 2012 Santosh Kumar et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
ISSN
1687-5370
DOI
10.1155/2012/831460
Publisher site
See Article on Publisher Site

Abstract

Hindawi Publishing Corporation International Journal of Plant Genomics Volume 2012, Article ID 831460, 15 pages doi:10.1155/2012/831460 Review Article SNP Discovery through Next-Generation Sequencing and Its Applications 1 2 1, 3 Santosh Kumar, Travis W. Banks, and Sylvie Cloutier Department of Plant Science, University of Manitoba, Winnipeg, MB, Canada R3T 2N2 Department of Applied Genomics, Vineland Research and Innovation Centre, Vineland Station, ON, Canada L0R 2E0 Cereal Research Centre, Agriculture and Agri-Food Canada, Winnipeg, MB, Canada R3T 2M9 Correspondence should be addressed to Sylvie Cloutier, sylvie.j.cloutier@agr.gc.ca Received 3 August 2012; Accepted 8 October 2012 Academic Editor: Roberto Tuberosa Copyright © 2012 Santosh Kumar et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The decreasing cost along with rapid progress in next-generation sequencing and related bioinformatics computing resources has facilitated large-scale discovery of SNPs in various model and nonmodel plant species. Large numbers and genome-wide availability of SNPs make them the marker of choice in partially or completely sequenced genomes. Although excellent reviews have been published on next-generation sequencing, its associated bioinformatics challenges, and the applications of SNPs in genetic studies, a comprehensive review connecting these three intertwined research areas is needed. This paper touches upon various aspects of SNP discovery, highlighting key points in availability and selection of appropriate sequencing platforms, bioinformatics pipelines, SNP filtering criteria, and applications of SNPs in genetic analyses. The use of next-generation sequencing methodologies in many non-model crops leading to discovery and implementation of SNPs in various genetic studies is discussed. Development and improvement of bioinformatics software that are open source and freely available have accelerated the SNP discovery while reducing the associated cost. Key considerations for SNP filtering and associated pipelines are discussed in specific topics. A list of commonly used software and their sources is compiled for easy access and reference. 1. Introduction Arabidopsis,and rice [11–15]. Genetic applications such as linkage mapping, population structure, association studies, Molecular markers are widely used in plant genetic research map-based cloning, marker-assisted plant breeding, and and breeding. Single Nucleotide Polymorphisms (SNPs) are functional genomics continue to be enabled by access to currently the marker of choice due to their large numbers large collections of SNPs. Arabidopsis thaliana was the first in virtually all populations of individuals. The applications plant genome sequenced [16]followedsoonafter by rice of SNP markers have clearly been demonstrated in human [17, 18]. In the year 2011 alone, the number of plant genomics where complete sequencing of the human genome genomes sequenced doubled as compared to the number led to the discovery of several million SNPs [1]and tech- sequenced in the previous decade, resulting in currently, 31 nologies to analyze large sets of SNPs (up to 1 million) have and counting, publicly released sequenced plant genomes been developed. SNPs have been applied in areas as diverse (http://www.phytozome.net/). With the ever increasing as human forensics [2] and diagnostics [3], aquaculture [4], throughput of next-generation sequencing (NGS), de novo marker assisted-breeding of dairy cattle [5], crop improve- and reference-based SNP discovery and application are now ment [6], conservation [7], and resource management in feasible for numerous plant species. fisheries [8]. Functional genomic studies have capitalized Sequencing refers to the identification of the nucleotides upon SNPs located within regulatory genes, transcripts, in a polymer of nucleic acids, whether DNA or RNA. Since and Expressed Sequence Tags (ESTs) [9, 10]. Until recently its inception in 1977, sequencing has brought about the large scale SNP discovery in plants was limited to maize, field of genomics and increased our understanding of 2 International Journal of Plant Genomics the organization and composition of plant genomes. and the enzymes necessary to create fluorescence through Tremendous improvements in sequencing have led to the the consumption of inorganic phosphate produced during generation of large amounts of DNA information in a very sequencing. The instrument washes the picotiter plate with short period of time [19]. The analyses of large volumes each of the DNA bases in turn. As template-specific incorpo- of data generated through various NGS platforms require ration of a base by DNA polymerase occurs, a pyrophosphate powerful computers and complex algorithms and have led (PPi) is produced. This pyrophosphate is detected by an to a recent expansion of the bioinformatics field of research. enzymatic luminometric inorganic pyrophosphate detection This book chapter focuses on the apriori discovery of SNPs assay (ELIDA) through the generation of a light signal through NGS, bioinformatics tools and resources, and the following the conversion of PPi into ATP [25]. Thus, the wells various downstream applications of SNPs. in which the current nucleotides are being incorporated by the sequencing reaction occurring on the bead emit a light signal proportional to the number of nucleotides incorpo- 2. History and Evolution of rated, whereas wells in which the nucleotides are not being Sequencing Technologies incorporated do not. The instrument repeats the sequential nucleotide wash cycle hundreds of times to lengthen the 2.1. Invention of Sequencing. In 1977, two sequencing meth- sequences. The 454 GS FLX Titanium XL platform currently ods were developed and published. The Sanger method is generates up to 700 MB of raw 750 bp reads in a 23 hour run. a sequencing-by-synthesis (SBS) method that relies on a The technology has difficulty quantifying homopolymers combination of deoxy- and dideoxy-labeled chain terminator resulting in insertions/deletions and has an overall error rate nucleotides [20]. The first complete genome sequencing, that of approximately 1%. Reagent costs are approximately $6,200 of bacteriophage phi X174, was achieved that same year using per run [26]. this pioneering method [21]. The chemical modification followed by cleavage at specific sites method also published in 1977 [22] quickly became the less favored of the two methods 2.2.2. Illumina Sequencing. Illumina technology, acquired by because of its technical complexities, use of hazardous Illumina from Solexa, followed the release of 454 sequencing. chemicals, and inherent difficulty in scale-up. In contrast, With this sequencing approach, fragments of DNA are the Sanger method, for which Frederick Sanger was awarded hybridized to a solid substrate called a flow cell. In a his second Nobel Prize in chemistry in 1980, was quickly process called bridge amplification, the bound DNA template adopted by the biotechnology industry which implemented fragments are amplified in an isothermal reaction where it using a broad array of chemistries and detection methods copies of the template are created in close proximity to the [19]. original. This results in clusters of DNA fragments on the flow cell creating a “lawn” of bound single strand DNA molecules. The molecules are sequenced by flooding the flow 2.2. Sequencing Technologies. In the last decade, new sequen- cell with a new class of cleavable fluorescent nucleotides and cing technologies have outperformed Sanger-based sequenc- the reagents necessary for DNA polymerization [27]. A ing in throughput and overall cost, if not quite in sequence complementary strand of each template is synthesized one length and error rate [23]. This section will focus on the base at a time using fluorescently labeled nucleotides. The three main NGS platforms as well as the two main third- fluorescent molecule is excited by a laser and emits light, the generation sequencing (TGS) platforms, their throughput colour of which is different for each of the four bases. The and relative cost. We made every effort to ensure the accuracy fluorescent label is then cleaved off and a new round of of the data at the time of submission. However, the cost and polymerization occurs. Unlike 454 sequencing, all four bases throughput of these sequencing platforms change rapidly are present for the polymerization step and only a single and, as such, our analysis only represents a snapshot in molecule is incorporated per cycle. The flagship HiSeq2500 time. The flux of innovation in this field imposes a need sequencing instrument from Illumina can generate up to for constant assessment of the technologies’ potentials and 600 GB per run with a read length of 100 nt and 0.1% error realignment of research goals. rate. The Illumina technique can generate sequence from opposite ends of a DNA fragment, so called paired-end (PE) 2.2.1. Roche (454) Sequencing. Pyrosequencing was the first reads. Reagent costs are approximately $23,500 per run [26]. of the new highly parallel sequencing technologies to reach the market [24]. It is commonly referred to as 454 sequencing after the name of the company that first commercialized 2.2.3. Applied Biosystems (SOLiD) Sequencing. The SOLiD it. It is an SBS method where single fragments of DNA system was jointly developed by the Harvard Medical School are hybridized to a capture bead array and the beads are and the Howard Hughes Medical Institute [28]. The library emulsified with regents necessary to PCR amplifying the preparation in SOLiD is very similar to Roche/454 in individually bound template. Each bead in the emulsion which clonal bead populations are prepared in microreac- acts as an independent PCR where millions of copies of the tors containing DNA template, beads, primers, and PCR original template are produced and bound to the capture components. Beads that contain PCR products amplified by beads which then serve as the templates for the subsequent emulsion PCR are enriched by a proprietary process. The sequencing reaction. The individual beads are deposited DNA templates on the beads are modified at their 3 end into a picotiter plate along with DNA polymerase, primers, to allow attachment to glass slides. A primer is annealed International Journal of Plant Genomics 3 to an adapter on the DNA template and a mixture of it to reach the unincorporated nucleotides above [32]. The fluorescently tagged oligonucleotides is pumped into the Pacific Biosciences sequencers can generate up to 140 MB flow cell. When the oligonucleotide matches the template of sequences per run (per smart cell) with reads of 2.5 Kbp sequence, it is ligated onto the primer and the unincorpo- at ∼85% accuracy. The cost per run per smart cell is rated nucleotides are washed away. A charged couple device approximately $600. (CCD) camera captures the different colours attached to the Among the TGS technologies, Pacific Biosciences primer. Each fluorescence wavelength corresponds to a par- SMART and Heliscope tSMS have been used in charac- ticular dinucleotide combination. After image capture, the terizing bacterial genomes and in human-disease-related fluorescent tag is removed and new set of oligonucleotides studies [31]; however, TGS has yet to be capitalized upon in are injected into the flow cell to begin the next round of DNA plant genomes. The Heliscope generates short reads (55 bp) ligation [19]. This sequencing-by-ligation method in SOLiD- which may cause ambiguous read mapping due to the 5500x1 platform generates up to 1,410 million PE reads of presence of paralogous sequences and repetitive elements 75 + 35 nt each with an error rate of 0.01% and reagent cost in plant genomes. The Pacific Biosciences reads have high of approximately $10,500 per run [26]. error rates which limit their direct use in SNP discovery. Although widely accepted and used, the NGS platforms However, their long reads offer a definite advantage to suffer from amplification biases introduced by PCR and fill gaps in genomic sequences and, at least in bacterial dephasing due to varying extension of templates. The genomes, NGS reads have proven capable of “correcting” TGS technologies use single molecule sequencing which the base call errors of this TGS technology [33–36]. Hybrid eliminates the need for prior amplification of DNA thus assemblies incorporating short (Illumina, SOLiD), medium overcoming the limitations imposed by NGS. The advantages (454/Roche), and long reads (Pac-Bio) have the potential to offered by TGS technology are (i) lower cost, (ii) high yield better quality reference genomes and, as such, would throughput, (iii) faster turnaround, and (iv) longer reads [19, provide an improved tool for SNP discovery. 29]. The TGS can broadly be classified into three different The choice of a sequencing strategy must take into categories: (i) SBS where individual nucleotides are observed account the research goals, ability to store and analyze data, as they incorporate (Pacific Biosciences single molecule real the ongoing changes in performance parameters, and the time (SMART), Heliscope true single molecule sequencing cost of NGS/TGS platforms. Some key considerations include (tSMS), and Life Technologies/Starlight and Ion Torrent), (ii) cost per raw base, cost per consensus base, raw and consensus nanopore sequencing where single nucleotides are detected accuracy of bases, read length, cost per read, and availability as they pass through a nanopore (Oxford/Nanopore), and of PE or single end reads. The pre- and postprocessing (iii) direct imaging of individual molecules (IBM). protocols such as library construction [37] and pipeline development and implementation for data analysis [38]are also important. 2.2.4. Helicos Biosciences Corporation (Heliscope) Sequencing. Heliscope sequencing involves DNA library preparation and DNA shearing followed by addition of a poly-A tail to the 2.3. RNA and ChIP Sequencing. Genome-wide analyses sheared DNA fragments. These poly-A tailed DNA fragments of RNA sequences and their qualitative and quantitative are attached to flow cells through poly-T anchors. The measurements provide insights into the complex nature of sequencing proceeds by DNA extension with one out of regulatory networks. RNA sequencing has been performed 4 fluorescent tagged nucleotides incorporated followed by on a number of plant species including Arabidopsis [39], detection by the Heliscope sequencer. The fluorescent tag soybean [40], rice [41], and maize [42]for transcript on the incorporated nucleotide is then chemically cleaved profiling and detection of splice variants. RNA sequencing to allow subsequent elongation of DNA [30]. Heliscope has been used in de novo assemblies followed by SNP sequencers can generate up to 28 GB of sequence data per run discovery performed in nonmodel plants such as Eucalyptus (50 channels) with maximum read length of 55 bp at ∼99% grandis [43], Brassica napus [44], and Medicago sativa [45]. accuracy [31]. The cost per run per channel is approximately RNA deep-sequencing technologies such as digital gene $360. expression [46] and Illumina RNASeq [47] are both qualita- tive and quantitative in nature and permit the identification 2.2.5. Pacific Biosciences SMART Sequencing. The Pacific of rare transcripts and splice variants [48]. RNA sequencing Biosciences sequencer uses glass anchored DNA polymerases may be performed following its conversion into cDNA which are housed at the bottom of a zero-mode waveguide that can then be sequenced as such. This method is, (ZMW). DNA fragments are added into the ZMW chamber however, prone to error due to (i) the inefficient nature with the anchored DNA polymerase and nucleotides, each of reverse transcriptases (RTs) [49], (ii) DNA-dependent labeled with a different colour fluorophore, and are diffused DNA polymerase activity of RT causing spurious second from above the ZMW. As the nucleotides circulate through strand DNA [50], and (iii) artifactual cDNA synthesis due the ZMW, only the incorporated nucleotides remain at to template switching [51]. Direct RNA sequencing (DRS) the bottom of the ZMW while unincorporated nucleotides developed by Helicos Biosciences Corporation is a high diffuse back above the ZMW. A laser placed below the throughput and cost-effective method which eliminates the ZMW excites only the fluorophores of the incorporated need for cDNA synthesis and ligation/amplification leading nucleotides as the ZMW entraps the light and does not allow to improved accuracy [52]. 4 International Journal of Plant Genomics Chromatin immunoprecipitation (ChIP) is a specialized most biologists are unfamiliar with Linux operating systems, sequencing method that was specifically designed to identify its structure and command lines, thereby imposing a steep DNA sequences involved in in vivo protein DNA interaction learning curve for adoption. Linux-based software such as [53]. ChIP-sequencing (ChIP-Seq) is used to map the Bowtie [59], BWA [60], and SOAP2/3 [61] have been used binding sites of transcription factors and other DNA binding widely for the analysis of NGS data. Other software may not sites for proteins such as histones. As such, ChIP-Seq does have gained broad acceptance but may have unique features not aid SNP discovery, but the availability of SNP data along worthnoting.For reviewsonNGS software,see Li and with ChIP-Seq allows the study of allele-specific states of Homer [62], Wang et al. [58], and Treangen and Salzberg chromatin organization. Deep sequence coverage leading to [63]. Characteristics of the most common NGS software dense SNP maps permits the identification of transcription and their attributes are listed in Table 1, and their download factor binding sites and histone-mediated epigenetic modi- information can be found in Table 4. fications [54]. ChIP-Seq can be performed on serial analysis of gene expression (SAGE) tags or PE using Sanger, 454, and Illumina platforms [55, 56]. 3.2. Consideration for Software Selection. In selecting soft- The DNA, RNA, and ChIP-Seq data is analysed using ware for NGS data analysis one must consider, among other a reference sequence if available or, in the absence of such things, the sequencing platform, the availability of a reference reference, it requires de novo assembly, all of which is genome, the computing and storage resources necessary, and performed using specialized software, algorithms, pipelines, the bioinformatics expertise available. Algorithms used for and hardware. sequence analysis have matured significantly but may still require computing power beyond what is currently available in most genomics facilities and/or long processing time. For 3. Computing Resources for Sequence Assembly example, in aligning 2 × 13,326,195 paired-end reads (76 bp) from The Cancer Genome Atlas project (SRR018643) [64], The next-generation platforms generate a considerable SHRiMP [65] took 1,065 hrs with a peak memory footprint amount of data and the impact of this with respect to of 12 gigabytes to achieve the mapping of 81% of the reads data storage and processing time can be overlooked when to the human genome reference whereas Bowtie used 2.9 designing an experiment. Bioinformatics research is con- gigabytes of memory, a run time of 2.2 hrs but only achieved stantly developing new software and algorithms, data storage a 67% mapping rate [58]. Both time and memory become approaches, and even new computer architectures to better critical when dealing with a very large NGS dataset. Fast and meet the computation requirements for projects incorpo- memory efficient sequence mapping seems to be preferred rating NGS. This chapter describes the state-of-the-art with over slower, memory demanding software even at the cost respect to software for NGS alignment and analysis at the of a reduced mapping rate. It should be noted that a higher time of writing. percentage of mapped reads is not a strict measure of quality because it may be indicative of a higher level of misaligned 3.1. Software for Sequence Analysis. Both commercial and reads or reads aligned against repetitive elements, features noncommercial sequence analysis software are available for that are not desirable [63]. Windows, Macintosh, and Linux operating systems. NGS In the absence of a reference genome, de novo assembly companies offer proprietary software such as consensus of a plant genome is achieved using sequence information assessment of sequence and variation (Cassava) for Illu- obtained through a combination of Sanger and/or NGS mina data and Newbler for 454 data. Such software tend of bacterial artificial chromosome (BAC) clones, or by to be optimized for their respective platform but have whole genome shotgun (WGS) with NGS [66]. De novo limited cross applicability to the others. Web-based por- assemblies are time consuming and require much greater tals such as Galaxy [57] are tailored to a multitude of computing power than read mapping onto a reference analyses, but the requirement to transfer multigigabyte genome. The assembly accuracy depends in part on the read sequence files across the internet can limit its usabil- length and depth as well as the nature of the sequenced ity to smaller datasets. Commercially available software genome. The genomes of Arabidopsis thaliana [16], rice such as CLC-Bio (http://www.clcbio.com/) and SeqMan [67], and maize [68] were generated using a BAC-by-BAC NGen (http://www.dnastar.com/t-sub-products-genomics- approach while poplar [69], grape [70], and sorghum seqman-ngen.aspx) provide a friendly user interface, are [71] genomic sequences were obtained through WGS. All compatible with different operating systems, require mini- genomes sequenced to date are fragmented to varying mal computing knowledge, and are capable of performing degrees because of the inability of sequencing technologies multiple downstream analyses. However, they tend to be rel- and bioinformatics algorithms to assemble through highly atively expensive, have narrow customizability, and require conserved repetitive elements. A list of current plant genome locally available high computing power. A recent review by sequencing projects, their sequencing strategies, and status Wang et al. [58] recommends Linux-based programs because from standard draft to finished can be found in the review by they are often free, not specific to any sequencing platform, Feuillet et al. [72]. and less computing power hungry and, as a consequence, Software programs such as Mira [73], SOAPdenovo [74], tend to perform faster. Flexibility in the parameter’s choice ABySS [75], and Velvet [76]havebeenusedfor de novo for read assembly is another major advantage. However, assembly. MIRA is well documented and can be readily International Journal of Plant Genomics 5 Table 1: List of most cited/used software for sequence assembly of NGS data. Source locations for these software are compiled in Table 4. Assembly type Supported parameters Name (current version) Output format Platform (algorithm) Color space Read length Gapped alignment Paired-end Reference CLC-Bio CLC-Bio Yes Arbitrary Yes Yes Linux/Windows/Mac OS X Reference ACE, BAM SeqMan NGen Yes Arbitrary Yes Yes Windows/Mac OS X Next Reference NextGENe Yes Arbitrary Yes Yes Windows/Mac OS X GENe Bowtie (2) Reference (FM-index) Yes Arbitrary Yes Yes SAM Linux/Windows/Mac OS X BWA Reference (FM-index) Yes Arbitrary Yes Yes SAM Linux SOAP (3) Reference (FM-index) Yes Arbitrary No Yes SOAP2/3 Linux MAQ (0.6.6) Reference (Hashing reads) Yes ≤127 Yes Yes MAQ Linux/Solaris/Mac OS X Reference Novoalign (2.07.07) Yes Arbitrary Yes Yes SAM Linux/Mac OS X (Hashing reference) Reference Mosaik (1.1.0018) Yes Arbitrary Yes Yes SAM Linux/Windows/Mac OS X/Solaris (Hashing reference) Reference SHRiMP (2.2.2) Yes Arbitrary Yes Yes SAM Linux/Mac OS X (Hashing reference) Reference FASTA, ACE Mira (3.4) Yes Arbitrary Yes Yes Linux 1 2 Commercial software. Option for de novo assembly and modules included for variant calling. 6 International Journal of Plant Genomics customized, but it requires substantial computing memory and is not suited for large complex genomes. Of the freely available software, SOAPdenovo is one of the fastest read assembly programs and it uses a comparatively moderate amount of computing memory. The assembly generated by SOAPdenovo can be used for SNP discovery using SOAPsnp as implemented for the apple genome [77]. ABySS can be deployed on a computer cluster. It requires the least amount of memory and can be used for large genomes. Velvet requires the largest amount of memory. It can use mate-pair information to resolve and correct assembly errors. Figure 1: Graphical user interface of Tablet, an assembly visualiza- tion program, displays the reference genome on top and the mapped reads with color-coded SNPs on the bottom. 4. SNP Discovery The most common application of NGS is SNP discovery, whose downstream usefulness in linkage map construction, information). Tablet has a user-friendly interface and is genetic diversity analyses, association mapping, and marker- widely used because it supports a wide array of commonly assisted selection has been demonstrated in several species used file formats such as SAM, BAM, SOAP, ACE, FASTQ, [78]. NGS-derivedSNPshavebeenreportedinhumans[79], and FASTA generated by different read assemblers such as Drosophila [80], wheat [81, 82], eggplant [83], rice [84–86], Bowtie, BWA, SOAP, MAQ, and SeqMan NGen. It displays Arabidopsis [87, 88], barley [14, 89], sorghum [90], cotton contig overview, coverage information, read names and it [91], common beans [78], soybean [92], potato [93], flax allows searching for specific coordinates on scaffolds. [94], Aegilops tauschii [95], alfalfa [96], oat [97], and maize Broadly used SNP calling software include Samtools [98]tonameafew. [103], SNVer [104], and SOAPsnp [74]. Samtools is popular SNP discovery using NGS is readily accomplished in because of its various modules for file conversion (SAM small plant genomes for which good reference genomes are to BAM and vice-versa), mapping statistics, variant calling, availablesuchasriceand Arabidopsis [86, 99]. Although SNP and assembly visualization. Recently, SOAPsnp has gained discovery in complex genomes without a reference genome popularity because of its tight integration with SOAP aligner such as wheat [81, 82], barley [14, 89], oat [97], and beans and other SOAP modules which are constantly upgraded [78] can be achieved through NGS, several challenges remain and provide a one stop shop for the sequencing analysis in other nonmodel but economically important crops. The continuum. Variant calling algorithms such as Samtools and presence of repeat elements, paralogs, and incomplete or SNVer can be used as stand-alone programs or incorporated inaccurate reference genome sequences can create ambi- into pipelines for SNP calling. Reviews of SNP calling guities in SNP calling [63]. NGS read mapping can also software have been published [63, 105]. Some of the main suffer from sequencing error (erroneous base calling) and features of the current commonly used software are listed in misaligned reads. The following section focuses on programs Table 2 (refer to Table 4 for download information). tailored for SNP discovery and emphasizes some of the precautions and considerations to minimize erroneous SNP calling. 4.2. SNP Discovery from Multiple Individuals and Complex Genomes. SNP discovery is more robust when multiple and 4.1. Software and Pipelines for SNP Discovery. In theory, a divergent genotypes are used simultaneously, creating the SNP is identified when a nucleotide from an accession read necessary basis to capture the genetic variability of a species. differs from the reference genome at the same nucleotide Large parts of plant genomes consist of repetitive elements position. In the absence of a reference genome, this is [106] which can cause spurious SNP calling by erroneous achieved by comparing reads from different genotypes using read mapping to paralogous repeat element sequences. In de novo assembly strategies [95]. Read assembly files gener- polyploid genomes such as cotton (allotetraploid), homoe- ated by mapping programs are used to perform SNP calling. ologous sequences can cause similar misalignment [91]. In practice, various empirical and statistical criteria are used Improved read assembly and filtering of SNPs become even to call SNPs, such as a minimum and maximum number of more important factors for accurate SNP calling in these reads considering the read depth, the quality score and the cases because they can mitigate the effects of errors caused consensus base ratio for examples [95]. Thresholds for these by paralogs and homoeologs. criteria are adjusted based on the read length and the genome Read assembly algorithms such as Bowtie and SOAP as coverage achieved by the NGS data. In assemblies generated well as variant calling/genotyping softwares such as GATK allowing single nucleotide variants and insertions/deletions [107] are rapidly evolving to accommodate an ever increas- (indels), a list of SNP and indel coordinates is generated and ing number of reads, increased read length, nucleo- the read mapping results can be visualized using graphical tide quality values, and mate-pair information of PE reads. user interface programs such as Tablet [100](Figure 1), SNP- Assembly programs such as Novoalign (http://www.novo- VISTA [101], or Savant [102](referto Table 4 for download craft.com/main/index.php) and STAMPY [108], although International Journal of Plant Genomics 7 Table 2: Commonly used NGS variant calling software. Download information for these software is compiled in Table 4. A more compre- hensive list of variant calling programs is available at http://seqanswers.com/wiki/Software/list. Software Multisample support Reference Features Platform Include computation of genotype Samtools Yes Aligned reads Linux likelihoods and variant calling SOAPsnp No Variant database Part of SOAP3 for variant calling Linux Include variant caller, SNP filter, and GATK Yes Aligned reads Linux SNP quality calibrator Fast variant caller, assigning SNP Windows, Linux, SNVer Yes Aligned reads significance based on read depth Mac OS X Variant calling based on reference Linux, SHORE Yes Aligned reads sequence even from other species Mac OS X Variant calling with or without LD MaCH Yes Genotype likelihoods Windows, Linux, Mac OSX information Candidate SNPs and Variant calling and linkage map-based Windows, Linux, IMPUTE2 Yes genotype likelihoods SNP imputation Mac OS X memory and time intensive, are highly sensitive for simul- Melting (HRM) curve analysis. Validation can serve as an taneous mapping of short reads from multiple individuals iterative and informative process to modify and optimize the [105]. SNP filtering criteria to improve SNP calling. For example, SNP calls can be significantly improved using filtering a subset of 144 SNPs from a total of 2,113,120 SNPs were criteria that are specific to the genome characteristics and validated using the Goldengate assay on 160 accessions in the dataset. For instance, projects aimed at resequencing apple [77]. Another example is illustrated in Figure 2 where can compare different datasets from the same genotype and a KASPar assay was performed on 92 genotypes from a segre- thus eliminate data with large discrepancies. This strategy gating population illustrating the validation of a single “T/C” identifies the most common sources of error and is applied SNP in two distinct clusters. Other validation strategies in the 1000 genome project [109]. Reduced representation used in nonmodel organisms are tabulated in Garvin et al. libraries (RRLs), that is, sequencing an enriched subset [111]. With the continuously competitive pricing of NGS, of a genome by eliminating a proportion of its repetitive genotyping-by-sequencing (GBS) is becoming a viable SNP fractions [79], reduce the probability of misalignments validation method. Either biparental segregating populations to repeats and thus potential downstream erroneous SNP or a collection of diverse genotypes can be sequenced at a calling. Filtering criteria that can improve SNP accuracy reasonable cost using indexing, that is, combining multiple include (i) a minimum read depth (often ≥3per genotype), independently tagged genotypes in a single NGS run to (ii) >90% nucleotides within a genotype having identical obtain genome-wide or reduced representation genome call at a given position (∼<10% sequencing error), (iii) a sequences at a lower coverage but potentially validating a read depth ≤ mean of the sequence depth over the entire much larger number of SNPs than the methods described mapping assembly, (iv) the elimination of ribosomal DNA above. Sequencing of segregating populations or diverse and other repetitive elements in the 50 nt flanking any SNP genotypes may also lead to the discovery of additional SNPs. call, and (v) masking of homopolymer SNPs with a given The two major factors affecting the SNP validation rate base string length (often ≥2). Additionally, in polyploid are sequencing and read mapping errors as discussed above. species, separate assembly of homoeologs using stringent NGS platforms have different levels of sequencing accuracies, mapping parameters is often essential for genome-wide and this may be the most important factor determining SNP identification to avoid spurious SNP calls caused by the variation in the validation, from 88.2% for SOLiD erroneous homoeologous read mapping [91]. followed by Illumina at 85.4% and Roche 454 at 71% [95]. The SNP validation rates can be improved using RRL for SNP discovery and choosing SNPs within the nonrepetitive 4.3. SNP Validation. Prior to any SNP applications, the sequences including predicted single copy genes and single discovered SNPs must be validated to identify the true SNPs copy repeat junctions shown to have high validation rates and get an idea of the percentage of potentially false SNPs [95]. resulting from an SNP discovery exercise. The need for validation arises because a proportion of the discovered SNPs could have been wrongly called for various reasons including 5. SNP Genotyping those outlined above. SNP validation can be accomplished using a variety of material such as a biparental segregating SNP genotyping is the downstream application of SNP population or a diverse panel of genotypes. Usually a small discovery to identify genetic variations. SNP applica- subset of the SNPs is used for validation through assays tions include phylogenic analysis, marker-assisted selection, such as the Illumina Goldengate [110], KBiosciences Com- genetic mapping of quantitative trait loci (QTL), bulked petitive AlleleSpecific-PCR SNP genotyping system (KAS- segregant analysis, genome selection, and genome-wide Par) (http://www.lgcgenomics.com/) or the High Resolution association studies (GWAS). The number of SNPs and 8 International Journal of Plant Genomics but they are species-specific, expensive to design and require specific equipment and chemistry. PCR and primer extension technologies like KASPar and Taqman (http://www.lifetechnologies.com/global/en/home.html)are limited by their low SNP throughput but can be useful to assay a large number of genotypes with few SNPs. NGS technologies have become viable for genotyping studies and may offer advantages over other genotyping methods in cost and efficiency. 5.1. Genotyping-by-Sequencing (GBS). Therehavebeena number of approaches developed that use complexity reduc- tion strategies to lower the cost and simplify the discovery of SNP markers using NGS, RNA-Seq, complexity reduction of polymorphic sequences (CRoPS), restriction-site-associated DNA sequencing (RAD-Seq), and GBS [118]. Of these Figure 2: Validation of a T/C SNP by a KASPar assay (KBiosciences, methodologies GBS holds the greatest promise to serve Herts, England). Genotypes with a “T” are represented by black the widest base of plant researchers because of its ability dots with a white cross clustered in the upper left and those with a “C” by white dots with a black cross in the bottom right cluster. to allow simultaneous marker discovery and genotyping The two black dots near the bottom left are negative controls. No with low cost and a simple molecular biology workflow. heterozygous individuals were present in this population. Briefly, GBS involves digesting the genome of each individual in a population to be studied with a restriction enzyme [119]. One unique and one common adapter are ligated to the fragments and a PCR is carried out which is biased individuals to screen are of primary importance in choosing an SNP genotyping assay, though cost of the assay and/or towards amplifying smaller DNA fragments. The resulting equipment and the level of accuracy are also important PCR products are then pooled and sequenced using an considerations. Illumina platform. The amplicons are not fragmented so Illumina Goldengate is a commonly used genotyping only the ends of the PCR products are sequenced. The assay because of its flexibility in interrogating 96 to 3,072 unique adapter acts as an ID tag so sequencing reads can be SNP loci simultaneously (http://www.illumina.com/). HRM associated with an individual. The technique can be applied analysis is suitable for a few to an intermediate number to species with or without a reference genome. The choice of of SNPs and can be performed within a typical laboratory enzyme has an effect on the number of markers identified setting. KASPar and SNPline genotyping systems (http:// and the amount of sequence coverage required. The more www.lgcgenomics.com/) can be used for genotyping a frequent the restriction recognition site, the higher the few to thousands of SNPs in a laboratory setting. The number of fragments and therefore more potential markers. SNPline system is available in SNPlite or SNPline XL Use of more frequent cutters may necessitate greater amounts versions to allow flexibility in sample number and SNP of sequencing depending on the application. Poland et al. assays. The iPLEX Gold technology developed by Sequenom [120] recently demonstrated the use of two restriction (http://www.sequenom.com/) is based on the MassARRAY enzymes to perform GBS in bread wheat, a hexaploid system which uses primer extension chemistry and matrix- genome. assisted laser desorption/ionisation-time of flight (MALDI- GBS has the potential to be a truly revolutionary TOF) mass spectrometry for genotyping. technology in the arena of plant genomics. It brings high The iPLEX Gold system has gained acceptance due density genotyping to the vast majority of plant species to its high precision and cost-effective implementation. that, until now, have had almost no investment in genomics High throughput chip-based genotyping assays such as resources. With little capital investment requirement and the Affymetrix GeneChip arrays (http://www.affymetrix an affordable per sample cost, all plant researchers now .com/estore/) and the Illumina BeadChips (http://www.illu- have powerful genomic and genetic methodologies available mina.com/) are capable of validating up to a million SNPs to them. Uses of GBS include applications in marker per reaction across an entire genome. Detailed analyses of discovery, phylogenetics, bulked segregant analysis, QTL SNP genotyping assays and their features are reviewed in mapping in biparental lines, GWAS, and genome selection. Tsuchihashi and Dracopoli [112], Sobrino and Carracedo GBS can also be applied to fine mapping in candidate gene [113], Giancola et al. [114], Kim and Misra [115], Gupta discovery and be used to generate high-density SNP genetic et al. [116], and Ragoussis [117]. A list of the most commonly maps to assist in de novo genome assembly. We predict used genotyping assays describing the assay type, technology, tremendous advances in functional genomics and plant throughput, multiplexing ability, and relative scalability can be found in Table 3. breeding from the implementation of GBS because it is truly Array-based technologies such as Infinium and Gold- a democratizing application for NGS in nonmodel plant engate substantially improved SNP genotyping efficiency, systems. International Journal of Plant Genomics 9 Table 3: Commonly used genotyping platforms. Throughput Relative scale Name Assay type Technology Multiplexing (samples) (no. of SNP/no. of individuals) Oligo nucleotide Genechip Hybridization 96/5 days Up to 18 × 10 Small/large array Infinium II Hybridization Bead array Up to 128/5 days Up to 13 × 10 Large/small-large Goldengate Primer extension-ligation Bead array 172/3 days Up to 3,072 Medium/large Mass spectrometry iPlex Primer extension 3840/2.5 days Up to 40 Medium/large (MALDI-TOF) Taqman PCR Taqman probe Up to 1536/day Up to 256 Medium/medium Capillary Up to 1536/3 SNPlex PCR Up to 48 Medium/large electrophoresis days FRET quenching KASPar PCR Up to 96/day — Medium/large oligos Primer FRET quenching Invader annealing/endonuclease Up to 384/day Up to 200,000 Medium/large oligos digestion HRM PCR Melting curve analysis Up to 1536/day — Medium/large 6. Applications of SNPS maize [128] are practical examples of gene discovery through SNP-based genetic maps. NGS and SNP genotyping technologies have made SNPs the most widely used marker for genetic studies in plant species 6.2. Genome-Wide Association Mapping. Association map- such as Arabidopsis [121]and rice [122]. SNPs can help to ping (AM) panels provide a better resolution, consider decipher breeding pedigree, to identify genomic divergence numerous alleles, and may provide faster marker-trait associ- of species to elucidate speciation and evolution, and to ation than biparental populations [129]. AM, often referred associate genomic variations to phenotypic traits [85]. The to as linkage disequilibrium (LD) mapping, relies on the ease of SNP development, reasonable genotyping costs, and nonrandom association between markers and traits [130]. the sheer number of SNPs present within a collection of LD can vary greatly across a genome. In low LD regions, individuals allow an assortment of applications that can have high marker saturation is required to detect marker-trait a tremendous impact on basic and applied research in plant association, hence the need for densely saturated maps. In species. general, GWASs require 10,000–100,000 markers applied to a collection of genotypes representing a broad genetic basis 6.1. SNPs in Genetic Mapping. A genetic map refers to [130]. the arrangement of traits, genes, and markers relative to In the past few years, NGS technologies have led to the each other as measured by their recombination frequency. discovery of thousands, even millions of SNPs, and novel Genetic maps are essential tools in molecular breeding for application platforms have made it possible to produce plant genetic improvement as they enable gene localization, genome-wide haplotypes of large numbers of genotypes, map-based cloning, and the identification of QTL [123]. making SNPs the ideal marker for GWASs. So far, 951 GWASs SNPs have greatly facilitated the production of much higher have been reported in humans (http://www.bing.com/ density maps than previous marker systems. SNPs discovered search?q=www.genome.gov%2Fgwastudies%2F&src=ie9tr). using RNA-Seq and expressed sequence tags (ESTs) have the In plants, such a study was first reported in Arabidopsis added advantage of being gene specific [124]. Their high for flowering time and pathogen-resistance genes [131]. A abundance and rapidly improving genotyping technologies GWAS performed in rice using ∼3.6 million SNPs identified make SNPs an ideal marker type for generating new genetic genomic regions associated with 14 agronomic traits [132]. maps as well as saturating existing maps created with The genetic structure of northern leaf blight, southern leaf other markers. Most SNPs are biallelic thereby having a blight, and leaf architecture was studied using ∼1.6 million lower polymorphism information content (PIC) value as SNPs in maize [133–135]. SNP-based GWAS was also compared to most other marker types which are often performed on species such as barley for which a reference multiallelic [125]. The limited information associated with genome sequence is not available [136]. Soto-Cerda and their biallelic nature is greatly compensated by their high Cloutier [137] have reviewed the concepts, benefits, and frequency, and a map of 700–900 SNPs has been found to limitations of AM in plants. be equivalent to a map of 300–400 simple sequence repeat (SSR) markers [125]. SNP-based linkage maps have been constructed in many economically important species such as 6.3. Evolutionary Studies. SSRs and mitochondrial DNA have rice [126], cotton [91]and Brassica [127]. The identification been used in evolutionary studies since the early 1990s [138]. of candidate genes for flowering time in Brassica [127]and However, the biological inferences from results of these two 10 International Journal of Plant Genomics Table 4: Download information of software used for NGS data. Software Source Bowtie http://bowtie-bio.sourceforge.net/bowtie2/index.shtml BWA http://bio-bwa.sourceforge.net/ SOAP http://soap.genomics.org.cn/soap3.html#down2 MAQ http://sourceforge.net/projects/maq/ Novoalign http://www.novocraft.com/main/index.php CLC-Bio Genomics http://www.clcbio.com/index.php?id=1240 SeqMan NGen http://www.dnastar.com/t-products-seqman-ngen.aspx NextGENe http://softgenetics.com/NextGENe.html Mosaik http://bioinformatics.bc.edu/marthlab/Mosaik SHRiMP http://compbio.cs.toronto.edu/shrimp/ Mira http://sourceforge.net/projects/mira-assembler/files/MIRA/stable/ Cassava http://www.illumina.com/software/genome analyzer software.ilmn Newbler http://www.454.com/products/analysis-software/index.asp Novoalign http://www.novocraft.com/main/downloadpage.php Tablet http://bioinf.scri.ac.uk/tablet/ SNP-VISTA http://genome.lbl.gov/vista/snpvista/ Samtools http://sourceforge.net/projects/samtools/ Savant http://genomesavant.com/savant/download.php SOAPsnp http://soap.genomics.org.cn/soapsnp.html GATK http://www.broadinstitute.org/gsa/wiki/index.php/ The Genome Analysis Toolkit SNver http://snver.sourceforge.net/ MaCH http://www.sph.umich.edu/csg/abecasis/MACH/ IMPUTE2 http://mathgen.stats.ox.ac.uk/impute/impute v2.html# download impute2 MEGA http://www.megasoftware.net/ PHYLIP http://evolution.genetics.washington.edu/phylip.html marker types may be misinterpreted due to homoplasy, a polymorphism that can be detected in the population used. phenomenon in which similarity in traits or markers occurs This results in a collection of markers that sample only a due to reasons other than ancestry, such as convergent fraction of the diversity that exists in the species but that are evolution, evolutionary reversal, gene duplication, and hor- nevertheless used to infer relatedness and determine genetic izontal gene transfer [139]. The advantage of SNPs over distance for whole populations. Ideally, a set of SNP markers microsatellites and mitochondrial DNA resides in the fact randomly distributed throughout the genome would be that SNPs represent single base nucleotide substitutions and, developed for each population studied. GBS moves us closer as such, they are less affected by homoplasy because their to this goal by incorporating simultaneous discovery of origin can be explained by mutation models [140]. SNPs SNPs and genotyping of individuals. With this approach have been employed to quantify genetic variation, for indi- genome sample bias remains but can be mitigated by careful vidual identification, to determine parentage relatedness and restriction enzyme selection. population structure [138]. Seed shattering (or loss thereof) has been associated with an SNP through a GWAS aimed at unraveling the evolution of rice that led to its domestication 7. Future Perspectives [141]. SNPs have also been used to study the evolution of genessuchas WAG-2 in wheat [142]. Algorithms such as SNP discovery incontestably made a quantum leap forward neighbor-joining and maximum likelihood implemented in with the advent of NGS technologies and large numbers of the PHYLIP [143]and MEGA [144] software are commonly SNPs are now available from several genomes including large used to generate phylogenetic trees. and complex ones (see Section 4). Unlike model systems such The main advantage of SNPs is unquestionably their as humans and Arabidopsis, SNPs from crop plants remain large numbers. As with all marker systems the researcher limited for the time being, but broad access to reasonable must be aware of ascertainment biases that exist in the panel cost NGS promises to rapidly increase the production of of SNPs being used. These biases exist because SNPs are reference genome sequences as well as SNP discovery. Many often developed from examining a small group of individuals issues remain to be addressed, such as the ascertainment bias and selecting the markers that maximize the amount of of popular biparental populations and the low validation International Journal of Plant Genomics 11 rate of some array-based genotyping platforms [145]. The Scandinavian wolf population,” Molecular Ecology, vol. 14, no. 2, pp. 503–511, 2005. area of epigenetic regulation of various genome components [8] C.T.Smith,C.M.Elfstrom, L. W. Seeb,and J. E. Seeb,“Use can be better understood as accurate and deeper sequencing of sequence data from rainbow trout and Atlantic salmon for is achieved. RNA and ChIP-sequencing projects, similar to SNP detection in Pacific salmon,” Molecular Ecology, vol. 14, RNA-Seq in the nonmodel plant sweet cherry to identify no. 13, pp. 4193–4203, 2005. SNPs and haplotypes [146], can be undertaken to study [9] B.N.Chorley,X.Wang, M. R. Campbell,G.S.Pittman, functional genomics. A great deal of knowledge that is still M. A. Noureddine, and D. A. Bell, “Discovery and veri- elusive about the noncoding and repetitive elements can fication of functional single nucleotide polymorphisms in be determined with the next wave of modern and efficient regulatory genomic regions: current and developing tech- sequencing technologies. nologies,” Mutation Research, vol. 659, no. 1-2, pp. 147–157, The first (Sanger) and the second (next) generation sequencing technologies have enabled researchers to char- [10] K. Faber, K. H. Glatting, P. J. Mueller, A. Risch, and A. Hotz- Wagenblatt, “Genome-wide prediction of splice-modifying acterize DNA sequence variation, sequence entire genomes, SNPs in human genes using a new analysis pipeline called quantify transcript abundance, and understand mechanisms AASsites,” BMC Bioinformatics, vol. 12, supplement 4, article such as alternative splicing and epigenetic regulation [29]. S2, 2011. Numerous plant genomes are now sequenced at various [11] S. Atwell,Y.S.Huang,B.J.Vilhjalmsson ´ et al., “Genome- levels of completion and many more are underway [72]. The wide association study of 107 phenotypes in Arabidopsis thal- NGS technologies have made SNP discovery affordable even iana inbred lines,” Nature, vol. 465, no. 7298, pp. 627–631, in complex genomes and the technologies themselves have improved tremendously in the past decade. Improvements [12] W. B. Barbazuk,S.J.Emrich, H. D. Chen,L.Li, andP.S. in TGS promise synergies with NGS technologies to further Schnable, “SNP discovery via 454 transcriptome sequenc- assist our understanding of plant genetics and genomics. ing,” Plant Journal, vol. 51, no. 5, pp. 910–918, 2007. NGS has revolutionized genomics-related research, and it is [13] A. Ching, K. S. Caldwell, M. Jung et al., “SNP frequency, our belief that the NGS-enabled discoveries will continue in haplotype structure and linkage disequilibrium in elite maize inbred lines,” BMC Genetics, vol. 3, article 19, 2002. the next decade. [14] T. J. Close, P. R. Bhat, S. Lonardi et al., “Development and im- plementation of high-throughput SNP genotyping in barley,” Acknowledgments BMC Genomics, vol. 10, article 582, 2009. [15] X. Xu, X. Liu, S. Ge et al., “Resequencing 50 accessions of The authors are grateful to Andrzej Walichnowski for help cultivated and wild rice yields markers for identifying agro- with paper editing, Joanne Schiavoni for formatting, and nomically important genes,” Nature Biotechnology, vol. 30, Michael Shillinglaw for figure preparation. This chapter was no. 1, pp. 105–111, 2012. [16] S. Kaul, H. L. Koo, J. Jenkins et al., “Analysis of the genome written within the scope of the Genome Canada TUFGEN sequence of the flowering plant Arabidopsis thaliana,” Nature, project, and support from all funding partners is gratefully vol. 408, no. 6814, pp. 796–815, 2000. acknowledged. [17] S. A. Goff, D. Ricke, T. H. Lan et al., “A draft sequence of the rice genome (Oryza sativa L. ssp. japonica),” Science, vol. 296, no. 5565, pp. 92–100, 2002. References [18] J. Yu, S. Hu, J. Wang et al., “A draft sequence of the rice genome (Oryza sativa L. ssp. indica),” Science, vol. 296, no. [1] K. A. Frazer, D. G. Ballinger, D. R. Cox et al., “A second gen- eration human haplotype map of over 3.1 million SNPs,” 5565, pp. 79–92, 2002. [19] J. A. Shendure, G. J. Porreca, and G. M. Church, “Overview Nature, vol. 449, no. 7164, pp. 851–861, 2007. [2] C. H. Brenner and B. S. Weir, “Issues and strategies in of DNA sequencing strategies,” Current Protocols in Molecular Biology, chapter 7, no. 81, pp. 7.1.1–7.1.11, 2008. the DNA identification of World Trade Center victims,” Theoretical Population Biology, vol. 63, no. 3, pp. 173–178, [20] F. Sanger, S. Nicklen, and A. R. Coulson, “DNA sequenc- ing with chain-terminating inhibitors,” Proceedings of the [3] M.I.McCarthy, G. R. Abecasis,L.R.Cardonetal., “Genome- National Academy of Sciences of the United States of America, vol. 74, no. 12, pp. 5463–5467, 1977. wide association studies for complex traits: consensus, uncertainty and challenges,” Nature Reviews Genetics, vol. 9, [21] F. Sanger, G. M. Air, B. G. Barrell et al., “Nucleotide sequence no. 5, pp. 356–369, 2008. of bacteriophage phiX174 DNA,” Nature, vol. 265, no. 5596, pp. 687–695, 1977. [4] Z.J.Liu andJ.F.Cordes, “DNA marker technologies and their applications in aquaculture genetics,” Aquaculture, vol. [22] A. M. Maxam and W. Gilbert, “A new method for sequencing DNA,” Proceedings of the National Academy of Sciences of the 238, no. 1–4, pp. 1–37, 2004. [5] L.R.Schaeffer, “Strategy for applying genome-wide selection United States of America, vol. 74, no. 2, pp. 560–564, 1977. [23] M. Kircher and J. Kelso, “High-throughput DNA sequen- in dairy cattle,” Journal of Animal Breeding and Genetics, vol. 123, no. 4, pp. 218–223, 2006. cing—concepts and limitations,” BioEssays,vol. 32, no.6,pp. 524–536, 2010. [6] H. Yu, W. Xie, J. Wang et al., “Gains in QTL detection using an ultra-high density SNP map based on population [24] M. Ronaghi, M. Uhlen, ´ and P. Nyren, ´ “A sequencing method based on real-time pyrophosphate,” Science, vol. 281, no. sequencing relative to traditional RFLP/SSR markers,” PLoS ONE, vol. 6, no. 3, Article ID e17595, 2011. 5375, pp. 363–365, 1998. [25] M. Ronaghi, S. Karamohamed, B. Pettersson, M. Uhlen, ´ and [7] J.M.Seddon, H. G. Parker,E.A.Ostrander,and H. Ellegren, “SNPs in ecological and conservation studies: a test in the P. Nyren, “Real-time DNA sequencing using detection of 12 International Journal of Plant Genomics pyrophosphate release,” Analytical Biochemistry, vol. 242, no. [44] M. Trick, Y. Long,J.Meng, andI.Bancroft, “Single nucleotide 1, pp. 84–89, 1996. polymorphism (SNP) discovery in the polyploid Brassica napus using Solexa transcriptome sequencing,” Plant Biotech- [26] T. C. Glenn, “Field guide to next-generation DNA sequen- cers,” Molecular Ecology Resources, vol. 11, no. 5, pp. 759–769, nology Journal, vol. 7, no. 4, pp. 334–346, 2009. 2011. [45] S. S. Yang, Z. J. Tu, F. Cheung et al., “Using RNA-Seq for gene identification, polymorphism detection and transcript [27] G. Turcatti, A. Romieu, M. Fedurco, and A. P. Tairi, “A new class of cleavable fluorescent nucleotides: synthesis and profiling in two alfalfa genotypes with divergent cell wall composition in stems,” BMC Genomics, vol. 12, no. 1, article optimization as reversible terminators for DNA sequencing by synthesis,” Nucleic Acids Research, vol. 36, no. 4, article e25, 199, 2011. 2008. [46] F. Ozsolak, D. T. Ting, B. S. Wittner et al., “Amplification-free digital gene expression profiling from minute cell quantities,” [28] J. Shendure,G.J.Porreca,N.B.Reppasetal., “Molecular biology: accurate multiplex polony sequencing of an evolved Nature Methods, vol. 7, no. 8, pp. 619–621, 2010. bacterial genome,” Science, vol. 309, no. 5741, pp. 1728–1732, [47] Z. Wang, M. Gerstein, and M. Snyder, “RNA-Seq: a revolu- 2005. tionary tool for transcriptomics,” Nature Reviews Genetics, [29] E. E. Schadt, S. Turner, and A. Kasarskis, “A window into vol. 10, no. 1, pp. 57–63, 2009. third-generation sequencing,” Human Molecular Genetics, [48] H. Xu, Y. Gao, and J. Wang, “Transcriptomic analysis of vol. 19, no. 2, pp. R227–R240, 2010. rice (Oryza sativa) developing embryos using the RNA-Seq [30] T. D. Harris, P. R. Buzby, H. Babcock et al., “Single-molecule technique,” PLoS ONE, vol. 7, no. 2, Article ID e30646, 2012. DNA sequencing of a viral genome,” Science, vol. 320, no. [49] J. D. Roberts, B. D. Preston, L. A. Johnston, A. Soni, L. A. 5872, pp. 106–109, 2008. Loeb, and T. A. Kunkel, “Fidelity of two retroviral reverse [31] C. S. Pareek, R. Smoczynski, and A. Tretyn, “Sequencing transcriptases during DNA-dependent DNA synthesis in vitro,” Molecular and Cellular Biology, vol. 9, no. 2, pp. 469– technologies and genome sequencing,” Journal of Applied Genetics, vol. 52, no. 4, pp. 413–435, 2011. 476, 1989. [32] J. Eid, A. Fehr, J. Gray et al., “Real-time DNA sequencing [50] U. Gubler, “Second-strand cDNA synthesis: mRNA frag- mentsasprimers,” Methods in Enzymology, vol. 152, pp. 330– from single polymerase molecules,” Science, vol. 323, no. 5910, pp. 133–138, 2009. 335, 1987. [33] S. Koren, M. C. Schatz, B. P. Walenz et al., “Hybrid error [51] J. Cocquet, A. Chong, G. Zhang, and R. A. Veitia, “Reverse transcriptase template switching and false alternative tran- correction and de novo assembly of single-molecule sequenc- ing reads,” Nature Biotechnology, vol. 30, no. 7, pp. 693–700, scripts,” Genomics, vol. 88, no. 1, pp. 127–131, 2006. 2012. [52] F. Ozsolak, A. R. Platt, D. R. Jones et al., “Direct RNA sequencing,” Nature, vol. 461, no. 7265, pp. 814–818, 2009. [34] F. Ribeiro, D. Przybylski, S. Yin et al., “Finished bacterial genomes from shotgun sequence data,” Genome Research.In [53] M. J. Solomon, P. L. Larsen, and A. Varshavsky, “Mapping press. protein-DNA interactions in vivo with formaldehyde: evi- dence that histone H4 is retained on a highly transcribed [35] A. Bashir, A. A. Klammer, W. P. Robins et al., “A hybrid approach for the automated finishing of bacterial genomes,” gene,” Cell, vol. 53, no. 6, pp. 937–947, 1988. Nature Biotechnology, vol. 30, no. 7, pp. 701–707, 2012. [54] T. S. Mikkelsen, M. Ku, D. B. Jaffe et al., “Genome-wide maps of chromatin state in pluripotent and lineage-committed [36] X. Zhang, K. W. Davenport, W. Gu et al., “Improving genome assemblies by sequencing PCR products with PacBio,” cells,” Nature, vol. 448, no. 7153, pp. 553–560, 2007. BioTechniques, vol. 53, no. 1, pp. 61–62, 2012. [55] P. Ng,J.J.Tan,H.S.Ooi et al., “Multiplex sequencing of paired-end ditags (MS-PET): a strategy for the ultra-high- [37] P. Kothiyal,S.Cox,J.Ebert,B.J.Aronow, J. H. Greinwald, and H. L. Rehm, “An overview of custom array sequencing,” throughput analysis of transcriptomes and genomes,” Nucleic Acids Research, vol. 34, no. 12, p. e84, 2006. Current Protocols in Human Genetics, no. 61, chapter 7, pp. 7.17.1–17.17.11, 2009. [56] G. Robertson, M. Hirst, M. Bainbridge et al., “Genome- [38] J. D. McPherson, “Next-generation gap,” Nature Methods, wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing,” vol. 6, no. 11, supplement, pp. S2–S5, 2009. Nature Methods, vol. 4, no. 8, pp. 651–657, 2007. [39] A. P. M. Weber, K. L. Weber, K. Carr, C. Wilkerson, and J. B. Ohlrogge, “Sampling the arabidopsis transcriptome with [57] B. Giardine, C. Riemer, R. C. Hardison et al., “Galaxy: a platform for interactive large-scale genome analysis,” Genome massively parallel pyrosequencing,” Plant Physiology, vol. 144, no. 1, pp. 32–42, 2007. Research, vol. 15, no. 10, pp. 1451–1455, 2005. [40] M. Libault, A. Farmer, T. Joshi et al., “An integrated [58] W. Wang, Z. Wei, T.-W. Lam, and J. Wang, “Next generation sequencing has lower sequence coverage and poorer SNP- transcriptome atlas of the crop model Glycine max, and its use in comparative analyses in plants,” Plant Journal, vol. 63, detection capability in the regulatory regions,” Scientific no. 1, pp. 86–99, 2010. Reports, vol. 1, article 55, 2011. [41] T. Lu, G. Lu, D. Fan et al., “Function annotation of the rice [59] B. Langmead,C.Trapnell, M. Pop, andS.L.Salzberg, transcriptome at single-nucleotide resolution by RNA-seq,” “Ultrafast and memory-efficient alignment of short DNA Genome Research, vol. 20, no. 9, pp. 1238–1249, 2010. sequences to the human genome,” Genome Biology, vol. 10, [42] W. B. Barbazuk, S. Emrich, and P. S. Schnable, “SNP no. 3, article R25, 2009. mining from maize 454 EST sequences,” Cold Spring Harbor [60] H. Li and R. Durbin, “Fast and accurate short read alignment Protocols. In press. with Burrows-Wheeler transform,” Bioinformatics, vol. 25, [43] E. Novaes,D.R.Drost,W.G.Farmerieetal., “High- no. 14, pp. 1754–1760, 2009. throughput gene and SNP discovery in Eucalyptus grandis,an [61] R. Li, C. Yu, Y. Li et al., “SOAP2: an improved ultrafast tool uncharacterized genome,” BMC Genomics, vol. 9, article 312, for short read alignment,” Bioinformatics, vol. 25, no. 15, pp. 2008. 1966–1967, 2009. International Journal of Plant Genomics 13 [62] H. Li and N. Homer, “A survey of sequence alignment Drosophila,” Nature Genetics, vol. 29, no. 4, pp. 475–481, algorithms for next-generation sequencing,” Briefings in 2001. Bioinformatics, vol. 11, no. 5, Article ID bbq015, pp. 473–483, [81] A. M. Allen, G. L. Barker, S. T. Berry et al., “Transcript-spe- 2010. cific, single-nucleotide polymorphism discovery and linkage [63] T. J. Treangen and S. L. Salzberg, “Repetitive DNA and next- analysis in hexaploid bread wheat (Triticum aestivum L.),” generation sequencing: computational challenges and solu- Plant Biotechnology Journal, vol. 9, no. 9, pp. 1086–1099, tions,” Nature Reviews Genetics, vol. 13, no. 1, pp. 36–46, 2011. [82] D. Trebbi, M. Maccaferri, P. de Heer et al., “High-throughput [64] R. McLendon, A. Friedman, D. Bigner et al., “Comprehensive SNP discovery and genotyping in durum wheat (Triticum genomic characterization defines human glioblastoma genes durum Desf.),” Theoretical and Applied Genetics, vol. 123, no. and core pathways,” Nature, vol. 455, no. 7216, pp. 1061– 4, pp. 555–569, 2011. 1068, 2008. [83] L. Barchi, S. Lanteri, E. Portis et al., “Identification of SNP [65] S. M. Rumble, P. Lacroute, A. V. Dalca, M. Fiume, A. Sidow, and SSR markers in eggplant using RAD tag sequencing,” and M. Brudno, “SHRiMP: accurate mapping of short color- BMC Genomics, vol. 12, article 304, 2011. space reads,” PLoS Computational Biology, vol. 5, no. 5, [84] F. A. Feltus,J.Wan,S.R.Schulze,J.C.Estill, N. Jiang, andA. Article ID e1000386, 2009. H. Paterson, “An SNP resource for rice genetics and breeding [66] S. Rounsley, P. R. Marri, Y. Yu et al., “De novo next generation based on subspecies Indica and Japonica genome alignments,” sequencing of plant genomes,” Rice, vol. 2, no. 1, pp. 35–43, Genome Research, vol. 14, no. 9, pp. 1812–1819, 2004. [85] K. L. McNally, K. L. Childs, R. Bohnert et al., “Genomewide [67] T. Sasaki,“Themap-based sequence of thericegenome,” SNP variation reveals relationships among landraces and Nature, vol. 436, no. 7052, pp. 793–800, 2005. modern varieties of rice,” Proceedings of the National Academy [68] E. Pennisi, “Plant sciences: corn genomics pops wide open,” of Sciences of the United States of America, vol. 106, no. 30, pp. Science, vol. 319, no. 5868, p. 1333, 2008. 12273–12278, 2009. [69] G. A. Tuskan, S. DiFazio, S. Jansson et al., “The genome [86] T. Yamamoto, H. Nagasaki, J. I. Yonemaru et al., “Fine of black cottonwood, Populus trichocarpa (Torr. & Gray),” definition of the pedigree haplotypes of closely related rice Science, vol. 313, no. 5793, pp. 1596–1604, 2006. cultivars by means of genome-wide discovery of single- [70] O. Jaillon, J. M. Aury, B. Noel et al., “The grapevine genome nucleotide polymorphisms,” BMC Genomics,vol. 11, no.1, article 267, 2010. sequence suggests ancestral hexaploidization in major angi- osperm phyla,” Nature, vol. 449, no. 7161, pp. 463–467, 2007. [87] G. Jander,S.R.Norris, S. D. Rounsley, D. F. Bush,I.M.Levin, [71] A. H. Paterson, J. E. Bowers, R. Bruggmann et al., “The and R. L. Last, “Arabidopsis map-based cloning in the post- genome era,” Plant Physiology, vol. 129, no. 2, pp. 440–450, Sorghum bicolor genome and the diversification of grasses,” Nature, vol. 457, no. 7229, pp. 551–556, 2009. 2002. [72] C. Feuillet, J. E. Leach, J. Rogers, P. S. Schnable, and K. [88] X. Zhang and J. O. Borevitz, “Global analysis of allele-specific expression in Arabidopsis thaliana,” Genetics, vol. 182, no. 4, Eversole, “Crop genome sequencing: lessons and rationales,” Trends in Plant Science, vol. 16, no. 2, pp. 77–88, 2011. pp. 943–954, 2009. [73] B. Chevreux, T. Pfisterer, B. Drescher et al., “Using the [89] R. Waugh, J. L. Jannink, G. J. Muehlbauer, and L. Ramsay, “The emergence of whole genome association scans in miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs,” barley,” Current Opinion in Plant Biology,vol. 12, no.2,pp. 218–222, 2009. Genome Research, vol. 14, no. 6, pp. 1147–1159, 2004. [90] J. C. Nelson, S. Wang, Y. Wu et al., “Single-nucleotide poly- [74] R. Li, Y. Li, X. Fang et al., “SNP detection for massively parallel whole-genome resequencing,” Genome Research, vol. morphism discovery by high-throughput sequencing in sorghum,” BMC Genomics, vol. 12, article 352, 2011. 19, no. 6, pp. 1124–1132, 2009. [91] R. L. Byers, D. B. Harker,S.M.Yourstone,P.J.Maughan, [75] J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. M. Jones, and I. Birol, “ABySS: a parallel assembler for short read and J. A. Udall, “Development and mapping of SNP assays in allotetraploid cotton,” Theoretical and Applied Genetics, vol. sequence data,” Genome Research, vol. 19, no. 6, pp. 1117– 1123, 2009. 124, no. 7, pp. 1201–1214, 2012. [76] D. R. Zerbino and E. Birney, “Velvet: algorithms for de novo [92] D. L. Hyten, S. B. Cannon, Q. Song et al., “High-throughput SNP discovery through deep resequencing of a reduced short read assembly using de Bruijn graphs,” Genome Research, vol. 18, no. 5, pp. 821–829, 2008. representation library to anchor and orient scaffolds in the soybean whole genome sequence,” BMC Genomics, vol. 11, [77] D. Chagne, ´ R. N. Crowhurst, M. Troggio et al., “Genome- no. 1, article 38, 2010. wide SNP detection, validation, and development of an 8K SNP array for apple,” PLoS ONE, vol. 7, no. 2, Article ID [93] J. P. Hamilton, C. N. Hansey, B. R. Whitty et al., “Single e31745, 2012. nucleotide polymorphism discovery in elite north American [78] A. J. Corte´s,M.C.Chavarro, andM.W.Blair,“SNP potato germplasm,” BMC Genomics, vol. 12, article 302, 2011. marker diversity in common bean (Phaseolus vulgaris L.),” [94] Y.-B. Fu and G. W. Peterson, “Developing genomic resources Theoretical and Applied Genetics, vol. 123, no. 5, pp. 827–845, in two Linum species via 454 pyrosequencing and genomic 2011. reduction,” Molecular Ecology Resources,vol. 12, no.3,pp. 492–500, 2012. [79] D. Altshuler, V. J. Pollara, C. R. Cowles et al., “An SNP map of the human genome generated by reduced representation [95] F. M. You, N. Huo, K. R. Deal et al., “Annotation-based shotgun sequencing,” Nature, vol. 407, no. 6803, pp. 513–516, genome-wide SNP discovery in the large and complex 2000. genome using next-generation sequencing Aegilops tauschii [80] J. Berger, T. Suzuki, K. A. Senti, J. Stubbs, G. Schaffner, without a reference genome sequence,” BMC Genomics, vol. 12, article 59, 2011. and B. J. Dickson, “Genetic mapping with SNP markers in 14 International Journal of Plant Genomics [96] Y. Han, Y. Kang, I. Torres-Jerez et al., “Genome-wide SNP [114] S. Giancola, H. I. McKhann, A. Ber ´ ard et al., “Utilization discovery in tetraploid alfalfa using 454 sequencing and high of the three high-throughput SNP genotyping methods, resolution melting analysis,” BMC Genomics, vol. 12, p. 350, the GOOD assay, Amplifluor and TaqMan, in diploid and 2011. polyploid plants,” Theoretical and Applied Genetics, vol. 112, no. 6, pp. 1115–1124, 2006. [97] R. E. Oliver, G. R. Lazo, J. D. Lutz et al., “Model SNP devel- opment for complex genomes based on hexaploid oat [115] S. Kim and A. Misra, “SNP genotyping: technologies and biomedical applications,” Annual Review of Biomedical Engi- using high-throughput 454 sequencing technology,” BMC Genomics, vol. 12, no. 1, article 77, 2011. neering, vol. 9, pp. 289–320, 2007. [116] P. K. Gupta, S. Rustgi, and R. R. Mir, “Array-based high- [98] E. Jones, W. C. Chu, M. Ayele et al., “Development of throughput DNA markers for crop improvement,” Heredity, single nucleotide polymorphism (SNP) markers for use in vol. 101, no. 1, pp. 5–18, 2008. commercial maize (Zea mays L.) germplasm,” Molecular [117] J. Ragoussis, “Genotyping technologies for genetic research,” Breeding, vol. 24, no. 2, pp. 165–176, 2009. Annual Review of Genomics and Human Genetics, vol. 10, pp. [99] S. Ossowski, K. Schneeberger, R. M. Clark, C. Lanz, N. 117–133, 2009. Warthmann, and D. Weigel, “Sequencing of natural strains [118] J. W. Davey, P. A. Hohenlohe, P. D. Etter, J. Q. Boone, J. M. of Arabidopsis thaliana with short reads,” Genome Research, Catchen, and M. L. Blaxter, “Genome-wide genetic marker vol. 18, no. 12, pp. 2024–2033, 2008. discovery and genotyping using next-generation sequenc- [100] I. Milne, M. Bayer, L. Cardle et al., “Tablet-next generation ing,” Nature Reviews Genetics, vol. 12, no. 7, pp. 499–510, sequence assembly visualization,” Bioinformatics, vol. 26, no. 3, pp. 401–402, 2009. [119] R. J. Elshire, J. C. Glaubitz, Q. Sun et al., “A robust, simple [101] N. Shah, M. V. Teplitsky, S. Minovitsky et al., “SNP-VISTA: genotyping-by-sequencing (GBS) approach for high diver- an interactive SNP visualization tool,” BMC Bioinformatics, sity species,” PLoS ONE, vol. 6, no. 5, Article ID e19379, 2011. vol. 6, no. 1, article 292, 2005. [120] J. A. Poland, P. J. Brown, M. E. Sorrells, and J.-L. Jannink, [102] M. Fiume, V. Williams, A. Brook, and M. Brudno, “Savant: “Development of high-density genetic maps for barley and genome browser for high-throughput sequencing data,” wheat using a novel two-enzyme genotyping-by-sequencing Bioinformatics, vol. 26, no. 16, Article ID btq332, pp. 1938– approach,” PLoS ONE, vol. 7, no. 2, Article ID e32253, 2012. 1944, 2010. [121] M. W. Horton, A. M. Hancock, Y. S. Huang et al., “Genome- [103] H. Li, B. Handsaker, A. Wysoker et al., “The sequence wide patterns of genetic variation in worldwide Arabidopsis alignment/map format and SAMtools,” Bioinformatics, vol. thaliana accessions from the RegMap panel,” Nature Genetics, 25, no. 16, pp. 2078–2079, 2009. vol. 44, no. 2, pp. 212–216, 2012. [104] Z. Wei, W. Wang, P. Hu, G. J. Lyon, and H. Hakonarson, [122] G. K. Subbaiyan, D. L. E. Waters, S. K. Katiyar, A. R. “SNVer: a statistical tool for variant calling in analysis Sadananda, S. Vaddadi, and R. J. Henry, “Genome-wide DNA of pooled or individual next-generation sequencing data,” polymorphisms in elite indica rice inbreds discovered by Nucleic acids research, vol. 39, no. 19, article e132, 2011. whole-genome sequencing,” Plant Biotechnology Journal, vol. [105] R. Nielsen, J. S. Paul, A. Albrechtsen, and Y. S. Song, “Geno- 10, no. 6, pp. 623–634, 2012. type and SNP calling from next-generation sequencing data,” [123] J. C. Nelson, “Methods and software for genetic mapping,” in Nature Reviews Genetics, vol. 12, no. 6, pp. 443–451, 2011. The Handbook of Plant Genome Mapping, pp. 53–74, Wiley- [106] R. Ragupathy, R. Rathinavelu, and S. Cloutier, “Physical VCH, Weinheim, Germany, 2005. mapping and BAC-end sequence analysis provide initial [124] A. Rafalski, “Applications of single nucleotide polymor- insights into the flax (Linum usitatissimum L.) genome,” phisms in crop genetics,” Current Opinion in Plant Biology, BMC Genomics, vol. 12, article 217, 2011. vol. 5, no. 2, pp. 94–100, 2002. [107] A. McKenna, M. Hanna, E. Banks et al., “The genome anal- [125] L. Kruglyak, “The use of a genetic map of biallelic markers ysis toolkit: a MapReduce framework for analyzing next- in linkage studies,” Nature Genetics, vol. 17, no. 1, pp. 21–24, generation DNA sequencing data,” Genome Research, vol. 20, no. 9, pp. 1297–1303, 2010. [126] W. Xie, Q. Feng, H. Yu et al., “Parent-independent geno- [108] G. Lunter and M. Goodson, “Stampy: a statistical algorithm typing for constructing an ultrahigh-density linkage map for sensitive and fast mapping of Illumina sequence reads,” based on population sequencing,” Proceedings of the National Genome Research, vol. 21, no. 6, pp. 936–939, 2011. Academy of Sciences of the United States of America, vol. 107, [109] R. M. Durbin, “A map of human genome variation from no. 23, pp. 10578–10583, 2010. population-scale sequencing,” Nature, vol. 467, no. 7319, pp. [127] F. Li, H. Kitashiba, K. Inaba, and T. Nishio, “A Brassica rapa 1061–1073, 2010. linkage map of EST-based SNP markers for identification [110] J. B. Fan, M. S. Chee, and K. L. Gunderson, “Highly parallel of candidate genes controlling flowering time and leaf genomic assays,” Nature Reviews Genetics, vol. 7, no. 8, pp. morphological traits,” DNA Research, vol. 16, no. 6, pp. 311– 632–644, 2006. 323, 2009. [111] M. R. Garvin, K. Saitoh, and A. J. Gharrett, “Application of [128] E. S. Buckler, J. B. Holland, P. J. Bradbury et al., “The genetic single nucleotide polymorphisms to non-model species: a architecture of maize flowering time,” Science, vol. 325, no. technical review,” Molecular Ecology Resources, vol. 10, no. 6, 5941, pp. 714–718, 2009. pp. 915–934, 2010. [129] S. A. Flint-Garcia, J. M. Thornsberry, and S. B. Edward, [112] Z. Tsuchihashi and N. C. Dracopoli, “Progress in high “Structure of linkage disequilibrium in plants,” Annual throughput SNP genotyping methods,” Pharmacogenomics Review of Plant Biology, vol. 54, pp. 357–374, 2003. Journal, vol. 2, no. 2, pp. 103–110, 2002. [130] P. K. Gupta, S. Rustgi, and P. L. Kulwal, “Linkage disequilib- [113] B. Sobrino and A. Carracedo, “SNP typing in forensic rium and association studies in higher plants: present status genetics: a review,” Methods in Molecular Biology, vol. 297, and future prospects,” Plant Molecular Biology, vol. 57, no. 4, pp. 107–126, 2005. pp. 461–485, 2005. International Journal of Plant Genomics 15 [131] M. J. Aranzana, S. Kim, K. Zhao et al., “Genome-wide association mapping in Arabidopsis identifies previously known flowering time and pathogen resistance genes,” PLoS Genetics, vol. 1, no. 5, p. e60, 2005. [132] X. Huang, X. Wei, T. Sang et al., “Genome-wide asociation studies of 14 agronomic traits in rice landraces,” Nature Genetics, vol. 42, no. 11, pp. 961–967, 2010. [133] K. L. Kump, P. J. Bradbury, R. J. Wisser et al., “Genome-wide association study of quantitative resistance to southern leaf blight in the maize nested association mapping population,” Nature Genetics, vol. 43, no. 2, pp. 163–168, 2011. [134] J. A. Poland, P. J. Bradbury, E. S. Buckler, and R. J. Nelson, “Genome-wide nested association mapping of quantitative resistance to northern leaf blight in maize,” Proceedings of the National Academy of Sciences of the United States of America, vol. 108, no. 17, pp. 6893–6898, 2011. [135] F. Tian, P. J. Bradbury, P. J. Brown et al., “Genome-wide association study of leaf architecture in the maize nested association mapping population,” Nature Genetics, vol. 43, no. 2, pp. 159–162, 2011. [136] R. K. Pasam, R. Sharma, M. Malosetti et al., “Genome-wide association studies for agronomical traits in a world wide spring barley collection,” BMC Plant Biology, vol. 12, article 16, 2012. [137] B. J. Soto-Cerda and S. Cloutier, “Association mapping in plant genomes,” in Genetic Diversity in Plants,M.C¸alis¸kan, Ed., pp. 29–54, InTech, 2012. [138] P. A. Morin, G. Luikart, and R. K. Wayne, “SNPs in ecology, evolution and conservation,” Trends in Ecology and Evolution, vol. 19, no. 4, pp. 208–216, 2004. [139] P. W. Hedrick, “Perspective: highly variable loci and their interpretation in evolution and conservation,” Evolution, vol. 53, no. 2, pp. 313–318, 1999. [140] A. Vignal, D. Milan, M. SanCristobal, and A. Eggen, “A review on SNP and other types of molecular markers and their use in animal genetics,” Genetics Selection Evolution, vol. 34, no. 3, pp. 275–305, 2002. [141] S. Konishi, T. Izawa, S. Y. Lin et al., “An SNP caused loss of seed shattering during rice domestication,” Science, vol. 312, no. 5778, pp. 1392–1396, 2006. [142] O. Wei, Z. Peng, Y. Zhou, Z. Yang, K. Wu, and Z. Ouyang, “Nucleotide diversity and molecular evolution of the WAG- 2geneincommonwheat (Triticum aestivum L.) and its relatives,” Genetics and Molecular Biology,vol. 34, no.4,pp. 606–615, 2011. [143] J. D. Retief, “Phylogenetic analysis using PHYLIP,” Methods in Molecular Biology, vol. 132, pp. 243–258, 2000. [144] K. Tamura, J. Dudley, M. Nei, and S. Kumar, “MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0,” Molecular Biology and Evolution, vol. 24, no. 8, pp. 1596–1599, 2007. [145] M. W. Ganal, T. Altmann, and M. S. Roder ¨ , “SNP identifica- tion in crop plants,” Current Opinion in Plant Biology, vol. 12, no. 2, pp. 211–217, 2009. [146] T. Koepke, S. Schaeffer, V. Krishnan et al., “Rapid gene- based SNP and haplotype marker development in non- model eukaryotes using 3’UTR sequencing,” BMC Genomics, vol. 13, no. 1, article 18, 2012. International Journal of Peptides Advances in International Journal of BioMed Stem Cells Virolog y Research International International Genomics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Journal of Nucleic Acids International Journal of Zoology Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Submit your manuscripts at http://www.hindawi.com The Scientific Journal of Signal Transduction World Journal Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 International Journal of Advances in Genetics Anatomy Biochemistry Research International Research International Microbiology Research International Bioinformatics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Enzyme Journal of International Journal of Molecular Biology Archaea Research Evolutionary Biology International Marine Biology Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Journal

International Journal of Plant GenomicsHindawi Publishing Corporation

Published: Nov 22, 2012

References