Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph

Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph Genome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there are very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary software that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algo- rithm behind the proprietary method, Bionano Genomics’ Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as rmapper , and com- pare its performance against the assembler of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) and Solve by Bionano Genomics on data from three genomes: E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) only successfully ran on E. coli. Moreover, on the human genome rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, rmapper is written in C++ and is publicly available under GNU General Public License at https:// github. com/ kingu fl/ Rmapp er . Keywords: Optical mapping, Single molecule maps, de Bruijn graph, Overlap-layout-consensus, Genome assembly, Mis-assemblies Introduction u Th s, optical mapping has assisted in the assembly of a In 1993 Schwartz et al. developed optical mapping [1], a variety of species – including various prokaryotic species system for creating an ordered, genome wide high reso- [9–11], rice [12], maize [13], mouse [14], goat [15], parrot lution restriction map of a given organism’s genome. [4], and amborella trichopoda [5]. Bionano Genomics has Since this initial development, genome wide optical maps enabled the automated generation of the data, enabling have found numerous applications including discover- the data to become more wide-spread. For example, Bio- ing structural variations [2, 3], scaffolding and validating nano data was generated for 133 species sequenced for contigs for several large sequencing projects [4, 5], and the Vertebrate Genomes Project. detecting mis-assembled regions in draft genomes [6–8]. Similar to sequencing, the protocol for producing opti- cal mapping data, begins with many fragmented copies of the genome of interest. This redundancy allows over - *Correspondence: kingdgp@ufl.edu lap between the raw data and assembly into longer con- Department of Computer and Information Science and Engineering, tiguous regions corresponding to the genome. With a Herbert Wertheim College of Engineering, University of Florida, Gainesville, USA selected enzyme, the genomic DNA fragments are nicked Full list of author information is available at the end of the article © The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Mukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 2 of 13 at each restriction site recognized by the enzyme. These by traversing this graph in a depth first manner. Bio - cleaved fragments are then photographed and analyzed nano Genomics Inc. provides a proprietary assembly in order to determine the length (in kbp) of the regions method, called Bionano Solve, however the source code between nick sites. The result of this process are opti - is not publicly available and the algorithmic details are cal maps for all the fragments, which are referred to as unknown due to the proprietary nature of the software. Rmaps. For example, given a genome fragment TTT TAA The alternative to an OLC approach for assembly is a CTG GGG GGG AAC TTT TTT TTA ACT TTTT and an enzyme de Bruijn graph approach that relies on building and tra- that recognizes the site AACT and cleaves in the mid- versing a de Bruijn graph constructed on the sequence dle, the resulting Rmap would be [6, 11, 11, 6]. Rmaps data. For simplicity, we give a constructive definition of by themselves are not traditionally used for analysis— the de Bruijn graph in the context of genome assembly. although, they can be [2, 3, 16]—and instead have to Given a set of sequences R ={r , . . . , r } and an integer k, 1 n be assembled into longer contiguous optical maps cor- the de Bruijn graph is constructed by creating a directed responding to the genome. Hence, assembly of Rmaps edge for each unique k length substring (k-mer) with the refers to the problem of generating a consensus genome nodes labeled as the k − 1 length prefix and k − 1 length wide optical map from overlapping Rmaps. suffix of the k-mer, and then all nodes that have the same Although optical mapping has been around for several label are merged. The important aspect of the de Bruijn decades, the problem of efficiently assembling the data graph assembly approach is that it avoids having to find largely remains open as there has been little work in this alignments between any pair of sequences, leading to area—which is largely due to the challenges posed by the an O(n) run-time. Since its introduction by Idury et  al. data itself. We should note that several related problems, [26] and Pevzner et  al. [27], this approach has become such as alignment of optical mapping data [16–22], have the most common paradigm for assembling short read been more thoroughly explored. Rmap data has a num- sequencing data because it led to huge gains in perfor- ber of errors that make it difficult to assemble—namely, mance over OLC approaches. Hence, applying a de Bruijn there exists added and deleted cut sites and sizing error, graph approach to Rmap assembly would likely lead to resulting in extra fragments, merges in neighboring frag- similar improvements by removing the burden of find - ments and under or over-estimates of the length of a ing all pairwise alignments between Rmaps. This assem - fragment. In the running example, the error free Rmap of bly works on the premise that a k-mer will occur exactly [6, 11, 11, 6] could occur as [6, 22, 6] with error. Nonethe- without error frequently in the data. Hence, the biggest less, there exists two Rmap assembly methods: Gentig by challenge we face is constructing a de Bruijn graph with Anantharaman et  al. [23] and the assembler of Valouev added and deleted cut-sites and sizing error. Even with- et  al. [24]. Developed in 1998, Gentig is the first Rmap out the occurrence of added and deleted cut-sites, k-mers assembly algorithm. It is based on a Bayesian model that created from Rmap data are unlikely to be exact repli- seeks to maximize the a posteriori estimate of the con- cas due to sizing error. For example, [6, 11, 11, 6] and [5, sensus optical map produced by the assembly of Rmaps. 10, 11, 7] should likely be recognized as instances of the It first computes the overlap between all pairs of Rmaps same k-mers in Rmap data. Thus, to overcome this chal - using dynamic programming, and then builds contigs by lenge the de Bruijn graph has to be redefined to account greedily merging the Rmaps based on alignment score. for the inexactness of the data. This process of merging contigs continues until all align - In this paper, we formulate and describe a de Bruijn ments above a certain score are merged. Valouev et  al. graph approach for de novo Rmap assembly, which heav- [24] implemented an overlap-layout-consensus (OLC) ily relies on redefining the de Bruijn graph to make it assembly algorithm using their alignment algorithm suitable for Rmap data. We accomplish this by extend- [25], which also starts by calculating alignment between ing the definition of a bi-label in the context of the paired all pairs of Rmaps, and identifying all alignments that de Bruijn graph that was introduced by Medvedev et  al. have score above a specified threshold. A graph is built, [28]. We refer to our modified de Bruijn graph as bi- where Rmaps are represented as nodes, and the non-fil - labelled de Bruijn graph. Next, we demonstrate how to tered alignments are represented as edges. The graph is efficiently build and store the de Bruijn graph using a two refined by eliminating paths in the graph that are weakly tier orthogonal-range search data structure. We imple- supported. In other words, if two connected regions ment this approach, leading to a novel Rmap assembler in the graph are joined by only a single path—or with that we call rmapper . We compare the performance of multiple paths, but having one or more common inter- our method with the assembler of Valouev et  al., and mediate nodes—then the graph is disconnected at these Bionano Solve on three genomes of varying size: E. coli, nodes. Further, an edge is removed if it is inconsistent human, climbing perch (a fish species from the Verte - with a higher scoring edge. Contigs are then generated brate Genomes Project). Our comparison demonstrates M ukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 3 of 13 that rmapper was more than 130 times faster and used et al. and Chen et al. predict a fixed probability for diges - less than five times less memory than Solve, and was tion of a cut-site while Li et al. model the probability more than 2,000 times faster than Valouev et  al. Also, of digestion as a function of lengths of the fragments rmapper successfully assembled the 3.1 million Rmaps flanking the cut-site. The likelihood of a missed cut-site of the climbing perch genome into contigs that covered decreases with the length of the fragment. All three mod- over 95% of the draft genome with zero mis-assemblies. els postulate additional or false cut-sites result from ran- dom breaks of the DNA molecule and hence model the Background and definitions number of false cuts per unit length of DNA as a Poisson Rmap data and genome wide optical maps distribution. Li et al. observed that false cuts occurred From a computer science perspective, we can view an less frequently at the two ends of an Rmap. Rmap R =[r , r , . . . , r ] as an ordered list of integers. 1 2 |R| Each number represents the length of the respective frag- Rmap segments and k‑mers ment. The size of an Rmap R denotes the number of frag- We define a segment s of an Rmap starting at position p,q ments in R, which we denote as |R|. For example, say we p and ending at position q, as the q − p + 1 consecutive have an enzyme that cleaves the DNA at the middle posi- fragments starting from r , i.e., [r , r , .., r ] . We define p p p+1 q tion of AACT and a genomic sequence TTT TAA CTG GGG the length of a segment as the summation of all of its GGG AAC TTT TTT TTA ACT TTTT , then the Rmap will be constituent fragments, i.e., r + ··· + r . We denote the p q R =[6, 11, 11, 6] corresponding to the cleaved sequences length of a segment s as ℓ(s ) . We note that the length p,q p,q [TTT TAA , CTG GGG GGGAA, CTT TTT TTTAA, CTT of the Rmap R should not be confused with the number TTT ]. of fragments, which we denote as its size |R|. In this paper, we extend the definition of a k-mer to the Error profile of Rmap data context of Rmap data as follows. Given an integer k, we There are three types of errors that can occur in optical define a k-mer as a segment of exactly k fragments, i.e., a mapping: (1) missing cut sites which are caused by an sequence of k successive fragments of an Rmap. Follow- enzyme not cleaving at a specific site, (2) additional cut ing the example from above, the following two 3-mers sites which can occur due to random DNA breakage and exist in R =[6, 11, 11, 6] : [6, 11, 11] and [11, 11, 6]. (3) inaccuracy in the fragment size due to the inability of the system to accurately estimate the fragment size. Prefixes and suffixes of Rmaps Continuing again with the example above, an example Given an Rmap R =[r , r , . . . , r ] , we define the x-size 1 2 |R| of an additional cut site would be when the second frag- prefix of R as R =[r , r , . . . , r ] , where x is at most 1 2 x ment of R is split into two, e.g., R =[6, 5, 6, 11, 6] , and |R|− 1 . Conversely, we define the x-size suffix of R as an example of a missing cut site would be when the last R =[r , . . . , r ] , where x is at most |R|− 1. |R|−x+1 |R| two fragments of R are joined into a single fragment, e.g., R =[6, 11, 17] . Lastly, an example of a sizing error would The Bi‑labelled de Bruijn graph be if the size of the first fragment is estimated to be 7 In this section, we modify the traditional definition of the rather than 6. de Bruijn graph for Rmap data by first redefining the con - Several different probabilistic models have been pro - cept of a bi-label for Rmap data. The term bi-label was posed for describing the sizing error, and the frequency first introduced by Medvedev et al. [28] in the context of of added and missed cut-sites, including the models of short read assembly to incorporate mate-pair data into Valouev et al. [25], Li et al. [29], and Chen et al. [30]. We assembly of paired-end reads. There the term bi-label briefly describe these models here but refer to the origi - refers to two k-mers separated by a specified genomic nal papers for a full description. Both Valouev et  al. and distance. The redefinition of the de Bruijn graph with Chen et  al. describe the observed fragment lengths as this extra information was shown to de-tangle the result- normal distribution with the mean being equal to the ing graph, making traversal more efficient and accurate. true length of the fragment and the standard deviation Here, we demonstrate that an equivalent paradigm can being a function of the true length, i.e. longer fragments be effective for Rmap assembly. exhibit larger standard deviation. In the model by Li et al. the sizing error uses a Laplace distribution as follows: if Bi‑labels the observed and actual size of a fragment are o and r , Given integers k and D, and Rmap R, we define a bi-label i i respectively, then the sizing error, o ∼ r × Laplace(µ , β) from an Rmap R, as a segment of R containing a pair of i i where µ and β are parameters of the Laplace distribution k-mers separated by the shortest segment that has a and are functions of r . All studies model the probability length of at least D. The following is a formal definition. of having a missed cut-site as a Bernoulli trial. Valouev Mukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 4 of 13 Bi‑label proximity Definition 1 Given an Rmap One of the challenges with Rmap data is the fact that the R =[r , r , ..., r , r , .., r ] , integers k and D, and a 1 2 i i+1 |R| fragments correspond to genomic distances and due to position i, we define the bi-label at position i to be 1 2 experimental error, the measured estimates for the same [s , r , . . . , r , s ] , where p = i + k and q is an index such p q k k genomic fragment are different across different Rmaps that ℓ(s )< D ≤ ℓ(s ) p,q−1 p,q representing the same genomic location. For example, 1 2 R =[5, 6, 7, 11, 5] and R =[6, 5, 6, 11, 6] likely correspond and s and s are the k-mers starting at positions i and k k to the same k-mer but the numerical nature makes it such q + 1 , resp e ctively . that they are not exactly equal. Thus, we need to define a 1 2 criteria such that two bi-labels drawn from different Rmaps Next, we refer to segment s between s and s as the p,q k k 1 2 but corresponding to the same genomic locations can be skip segment, and note that, unlike s and s which both k k identified and merged for the construction of the de Bruijn have k fragments, this segment is only bounded by its graph. Thus, to make the definition of a bi-label robust to length and can have any number of fragments. u Th s, this sizing errors, we define conditions on both the difference accounts for added and deleted cut-sites since these of the individuals fragments of two bi-labels and the dif- errors do not impact the length of a segment. Figure  2 ference in the total lengths. Hence, we have the following demonstrates how the skip-segment tolerates a deleted definitions. cut-site. For example, given k = 3 , D = 25 , and R =[7, 18, 13, 3, 15, 12, 4, 3, 6, 5, 13, 2] , the bi-labels of R Definition 4 Given integers t , k and D, and two are [7, 18, 13] [3, 15, 12] [4, 3, 6] , f bi-labels a and b, we let the k-mers of a and b be [18, 13, 3][15, 12][4, 3, 6] and [13, 3, 15][12, 4, 3, 6] 1 1 1 2 2 2 1 1 1 a =[a , .., a ] and a =[a , .., a ] and b =[b , .., b ] 1 k 1 k 1 k 2 2 2 [5, 13, 2] . We are now going to define the prefix and suf - and b =[b , .., b ] , respectively. We define a and b to 1 k 1 1 be fragment proximal if and only if |a − b |≤ t and fix bi-labels. f i i 2 2 |a − b |≤ t for all i = 1, .., k. i i Definition 2 Given integers D and k and bi-label b with 1 1 1 2 2 2 Here t is an error-tolerance parameter that handles siz- k-mers b =[b , ..b ] and b =[b , .., b ] and skip seg- 1 k 1 k ing errors on the fragments of the bi-label. ment b , we define the prefix bi-label of b as the bi-label with (k − 1)-mers and skip-segment length at least D, Definition 5 Given integers t , k and D, and two bi- where the first (k − 1)-mer is the (k − 1)-size prefix of b 1 1 labels a and b, we let the k-mers of a and b be a and i.e. [b , ..b ]. 1 k−1 2 1 2 a and b and b , respectively, and the skip segment of s s a and b be a and b , respectively. We define a and b to Note that the second (k − 1)-mer of the prefix bi-label 1 1 be length proximal if and only if |ℓ(a ) − ℓ(b )|≤ t , is not necessarily the (k − 1)-size prefix of b . We also 2 2 s s |ℓ(a ) − ℓ(b )|≤ t and |ℓ(a ) − ℓ(b )|≤ t . ℓ ℓ require an equivalent definition for the suffix of a bi-label. Here t is another error-tolerance parameter that handles Definition 3 Given integers D and k and bi-label b with 1 1 1 2 2 2 sizing errors on the segment lengths of the bi-label. These k-mers b =[b , ..b ] and b =[b , .., b ] and skip seg- 1 k 1 k two definitions lead to our final definition that defines ment b , we define the suffix bi-label of b as the bi-label whether two bi-labels should be defined as equivalent in with (k − 1)-mers and skip-segment length at least D, the de Bruijn graph. where the first (k − 1)-mer is the (k − 1)-size suffix of b 1 1 i.e. [b , ..b ]. 2 k Definition 6 Given integers k and D and two bi-labels a and b, we define them to be proximal if and only if they Figure  1 illustrates this concept of prefix and suffix are fragment proximal and length proximal. bi-labels. Note that for two successive bi-labels from an Rmap, the prefix bi-label of the latter is the same as the This leads to our final definition, which is the set of bi- suffix bi-label of the former as shown in Fig.  1. This is a labels in which the bi-labelled de Bruijn graph is defined vital property that allows the de Bruijn graph constructed on. over bi-labels to be connected. Definition 7 Given a set of Rmaps {R , .., R } and inte- 1 n gers k and D, let B be the set of bi-labels from R. We define the proximal reduced set of bi-labels as the set M ukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 5 of 13 Fig. 1 All bi-labels for k = 3 and D = 25 of an Rmap R. On each bi-label the fragments from the k-mers and the length of the skip segment are shown in white while the fragments of the skip segment are shown in blue. For each bi-label we show the prefix and suffix bi-labels built with k = 2 and D = 25 ′ ′ gluing operation. A pair of node bi-labels are glued into B , where for each b in B there is a bi-label in B that it is a single node if and only if they are proximal. We define proximal to. the final graph obtained after gluing of nodes as the bi- labelled de Bruijn graph. Definition of the bi‑labelled de Bruijn graph Given the above definitions, we are now ready to define Methods the bi-labelled de Bruijn graph built on a set of proxi- In this section, we describe our method for building and mal bi-labels extracted from Rmaps. traversing the bi-labelled de Bruijn graph from an Rmap dataset. Our method, which we refer to as rmapper , can Definition 8 Given integers k and D and set of Rmaps be summarized into the following steps: extract and store {R , .., R } , let B be the proximal reduced set of bi-labels 1 n bi-labels, find proximal bi-labels, build the bi-labelled de extracted from R. We create a directed edge e for each bi- Bruijn graph, resolve tips and bubbles, and traverse the label b in B and label the incoming and outgoing nodes of graph to build the contigs. We now describe each of these e as the prefix bi-label of b and suffix bi-label of b, respec - steps in detail. tively. After all edges are formed, the graph undergoes a Mukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 6 of 13 ′ ′ ′ Fig. 2 Skip segment overcomes missed cut-site. All bi-labels for k = 3 and D = 25 of two Rmaps R and R , {b , b , b } and {b , b } respectively. Both 1 2 3 1 2 Rmaps cover the same genomic location but R has a missed cut-site in position 5 (shown in red). On each bi-label the fragments from the k-mers and the length of the skip segment are shown in white while the fragments of the skip segment are shown in blue. Despite the missed cut-site on ′ ′ ′ R bi-labels b and b are merged to b and b respectively according to our merge function 1 2 1 2 2 2 2 Extract and store all Bi‑labels and b =[b , .., b ] . We perform a range query with 1 k 1 1 2 2 We first error correct the Rmap data using cOMet [31] ([b ± t ], . . . , [b ± t ], [b ± t ], . . . , [b ± t ]) in the f f f f 1 k 1 k and then extract and store all bi-labels from the error disjoint set of k-d trees to find all bi-labels whose first k- 1 1 corrected Rmaps. We recall from Definition 6 that two mer is equal to [b ± t ], . . . , [b ± t ] and whose second f f 1 k 2 2 bi-labels are proximal if they are both fragment proximal k-mer is equal to [b ± t ], . . . , [b ± t ] . We add a pointer f f 1 k as well as length proximal for error-tolerance param- from b to each of these bi-labels. We repeat this for each eters t and t . Therefore, we must store all the bi-labels f ℓ bi-label. In particular, we perform the range query in in a manner that allows finding all proximal bi-labels of all k-d trees where the proximal bi-labels can be found, a given bi-label efficiently. To accomplish this, we store ′ ′ i.e., all k-d trees K where for m = min(kt , t ) we f ℓ a ,a ,a 1 2 1 ′ 1 all the bi-labels in a disjoint set of k-d trees [32] such that have, ⌊(ℓ(b ) − m)/t ⌋≤ a ≤⌊(ℓ(b ) + m)/t ⌋ and ℓ ℓ 2 ′ 2 each pair of bi-labels in the same k-d tree is length proxi- ⌊(ℓ(b ) − m)/t ⌋≤ a ≤⌊(ℓ(b ) + m)/t ⌋. ℓ ℓ mal. For each bi-label, the 2k fragments of the k-mers of it We note that k-d trees support multi-dimensional (2k−1)/2k are stored in the corresponding k-d tree, which will allow orthogonal range-search queries in O(n + occ) for efficiently finding all fragment proximal bi-labels of a time and O(n) space where n is the number of bi-labels given bi-label. Hence, the dimension of each k-d tree is in the tree, k is the k-mer value, and occ is the number of 2k. bi-labels that satisfy the constraints of the range-search More formally, we identify each k-d tree K by a ,a ,a 1 2 3 query. three positive integers a , a , and a , and insert a given bi- 1 2 3 label b into K if the length of its two k-mers ℓ(b ) and a ,a ,a 1 2 3 ℓ(b ) are within the range [a × t , . . . , (a + 1) × t − 1] 1 ℓ 1 ℓ Graph construction and [a × t , . . . , (a + 1) × t − 1] respectively and the 2 ℓ 2 ℓ We first filter all low frequency bi-labels, i.e., bi-labels length of the skip segment ℓ(b ) is also within the range that have a low number of proximal bi-labels. As illus- [a × t , . . . , (a + 1) × t − 1] . If such a tree does not 3 ℓ 3 ℓ trated in Fig.  4, bi-labels that have low frequency typi- exist then we create a new one with K , where a ,a ,a 1 2 3 cally arise from Rmap data that is highly erroneous. After 1 2 s a =⌊ℓ(b )/t ⌋ , a =⌊ℓ(b )/t ⌋ and a =⌊ℓ(b )/t ⌋. 1 ℓ 2 ℓ 3 ℓ filtering low frequency bi-labels, we build the bi-labelled Next, for each bi-label in our set of k-d trees, we find de Bruijn graph by first building a proximal reduced set and store pointers to all proximal bi-labels by performing from the unfiltered bi-labels, then building all directed an orthogonal range query. Given a bi-label b in K , a ,a ,a 1 2 3 edges with labelled nodes from the reduced set, and 1 1 1 we let the k-mers of the bi-label b be b =[b , .., b ] 1 finally merging nodes that have the same label. Using k M ukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 7 of 13 an efficient heuristic, we first greedily find the proximal adjacency lists of two nodes to be efficiently merged if the reduced set of bi-labels by sorting the unfiltered bi-labels bi-labels they represent are found to be proximal. in descending order based on the number of proximal bi- Lastly, we merge all nodes in the graph whose bi- labels found for them. From this sorted list of bi-labels labels are proximal to obtain the final bi-labelled de B, we iteratively insert bi-labels into the reduced set B Bruijn graph. For merging the nodes, we again use a set unless the bi-label is proximal to a bi-label already in B . of disjoint k-d trees as we did before for finding proxi - Next, we build a bi-labelled de Bruijn graph by creating mal bi-labels for the edge bi-labels. Hence, we extract ′ ′ a directed edge for each bi-label b in B and labeling the all the node bi-labels and construct a set of k-d trees as incoming and outgoing nodes as the prefix bi-label and suf - before. Then for each node v in the node array, we query x bi-l fi abel of b . We store all the nodes and edges in a mod- the corresponding k-d trees to find all nodes that are ified adjacency list format that contains three arrays: one proximal to it using the same error tolerance param- array stores all node bi-labels, one array containing a list of eters t and t . Any node u that is found to be proximal f ℓ pointers of the incoming nodes for each node, and lastly, to v is merged to v by removing u from the graph by one array containing a list of pointers of the outgoing nodes updating the two adjacency lists such that the incom- for each node. u Th s, to insert b into the graph, we first ing and outgoing array entries storing pointers to u are determine if the prefix and suffix bi-labels are contained in updated to store pointers to v. This can be achieved the node array and insert them if they are not contained in in linear time. We repeat this until all proximal nodes the list, and then insert an entry into the incoming and out- have been merged. Figure 3 illustrates the construction going arrays with lists containing pointers to the prefix and of the bi-labelled de Bruijn graph for a pair of Rmaps. suffix bi-labels. This graph representation will allow for the Fig. 3 The construction of the bi-labelled de Bruijn Graph. a Two Rmaps R and R and the bi-labels extracted from them—{b , b , b } from R and 1 2 1 2 3 1 {b , b } from R for k = 3 and D = 25 . b Edges {e , e , e } depict the proximal reduced set of bi-labels. Bi-labels {b , b } are represented by e , bi-labels 3 4 2 1 2 3 1 4 1 {b , b } are represented by e and bi-label {b } forms e . We note that in this example no bi-labels are filtered for finding the proximal reduced set. 2 5 2 3 3 c Nodes introduced into the graph. Each edge breaks into two nodes—one denoted by the prefix bi-label and the other by suffix bi-label of the edge. A directed edge is drawn from the former to the latter. d The final graph is formed by merging nodes v with v and merging v with v 12 21 22 32 Mukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 8 of 13 an entire path from the graph to resolve a bubble—rather, we only disconnect them at the branching node. Follow- ing the work of Simpson et al. [35], we fix the maximum length of the paths in a bubble to twice the size of the bi-label. After cleaning, our traversal algorithm extracts unit- igs (i.e. contigs corresponding to unary paths) from the graph by performing a simple depth first traversal start - ing from each node with zero incoming edges. We termi- nate the traversal of a given path if a cycle is reached or a node with out-degree greater than one is reached. Experiments In this section, we compare the performance of rmapper , the assembler of Valouev et  al. and Bionano Solve. We used the most recent version of Bionano Solve that is publicly available (version 3.5.1.). We performed all Fig. 4 Histogram showing the precision of finding proximal bi-labels. experiments on Intel E5-2698v3 processors with 192 GB For simulated human Rmap data, we found proximal bi-labels for of RAM running 64-bit Linux. Valouev and rmapper were all extracted bi-labels. We designate a proximal bi-label found to be a true positive if its true location in the genome is the same as the ran on error corrected data, which is analogous to assem- location of the bi-label to which it is proximal—and false positive bly of sequence reads. Bionano Solve was not because the otherwise. Next, we plotted a histogram showing the distribution of input is required to be specified in their proprietary for - true positives and false positive proximal bi-labels for each bi-label. mat. In addition, for larger genomes, we also ran rmapper We show that high frequency bi-labels i.e. bi-labels for which we find by extracting bi-labels from both directions in an Rmap. more proximal bi-labels produce more precise proximal bi-labels. This justifies filtering low frequency bi-labels We refer to this as rmapper2.0. For all experiments we report the run time (CPU time), peak memory, maximum and mean contig size, genome Graph cleaning and traversal fraction and number of mis-assembled contigs. We note Before traversing the graph, we first pre-process the bi- that genome assembly evaluation tools such as QUAST labelled de Bruijn graph to remove tips and bubbles, [37] cannot be used on optical maps—hence, we design which are common in de Bruijn graphs. Since they limit our own evaluation setup. To compute the genome the size of unary paths (i.e. paths in the graph that con- fraction, we align all assembled contigs to the optical tain nodes with only a single outgoing edge) and do not map reference genome using the alignment method of affect the accuracy of the assembly, it is common prac - Valouev et  al. [25]. The optical map reference genome tice in short read assembly to resolve or remove these is produced by in silico digesting the reference genome structures [33–36]. Tips are produced when errors cause using the same restriction enzyme as used for producing an otherwise unary path to branch at a node and create the Rmaps. For all contigs that were successfully aligned, a short unary path that ends in a terminal node. Bubbles we designate their alignment locations on the reference are created when bi-labels from the same genomic loca- genome as covered and report the percentage of the tion are not merged and included in the graph as sepa- genome covered by at least one contig as the genome rate edges. This generates short unary paths that have the fraction. Any contig which is unable to be aligned by same starting node and the same ending node and are Valouev et  al. is verified to be mis-assembled by align - close in length. ing it to the reference genome using a second alignment Similar to existing short read assemblers, we identify software—Bionano’s RefAligner. The Valouev method all tips and bubbles that have length of at most a speci- aligns an assembled contig to a contiguous stretch of the fied threshold by performing depth first search starting reference optical map that optimizes its alignment score at each node with out-degree greater than one. Hence, and does not tolerate mis-assembled regions, whereas if there exists a tip starting at a given node as well as a RefAligner allows split alignments. Hence, if the align- path of length longer than the specified threshold, then ment outputted from RefAligner is uncontiguous then it the tip is removed by deleting all of its edges starting at is counted as a mis-assembly. the branching node. Furthermore, if there exists a bubble rmapper takes as input four parameters, namely the starting at a given node, we remove one of the edges adja- size k of the k-mers, the minimum distance D between cent to the branching node. We note we do not remove the two k-mers in the bi-label, and the error tolerance M ukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 9 of 13 parameter setting t and t . The k-mer size depends on coli K-12 substr. MG1655 genome and the human ref- f ℓ the rate of added and missed cut-sites in the Rmap data. erence genome GRCh38 (NCBI accession number When the frequency of added and missed cut-sites is GCF_000001405.26) with OMSim [38]. We used enzyme high, the k-mer size needs to be set low so that a good BspQI — a standard, commonly used restriction enzyme percentage of k-mers are error-free. We note that the for optical mapping — and used the default error rate of average error-rate of optical-map data typically lies OMSim, which is a 15% rate of deleted cut sites, and 1 around 17% [30]. Considering that error-correction of added cut site per 100kbp. The resulting E. coli dataset the Rmaps is likely to bring the average error-rate below contains 23450 Rmaps with a mean of 42 fragments per 10% [31], the k-mer size of 6 is the largest value such that Rmap. The human dataset contains 377894 Rmaps with a the probability that an extracted k-mer will be error-free mean of 61 fragments per Rmap. is at least 50% . Hence we use 6 as the default k-mer size in Lastly, we performed experiments using the Rmap our experiments. The best combination of coverage, aver - dataset of the climbing perch (Anabas testudineus) age length of contigs and run-time is achieved by fixing genome generated for the Vertebrate Genomes Project, t = 2000 . We experimented with the following values which consists of 3121480 Rmaps with mean of 28 frag- of D ={15000, 20000, 25000, 30000} and the following ments. A draft assembly of the genome is provided from values of t ={500, 1000, 1500} and for each experiment, the same source which was used to obtain the reference we choose the parameter setting that gives the best per- genome optical map. formance. A higher value of t is needed when the Rmap data still has significant sizing errors after error cor - Impact of parameters rection. A lower value of D is needed when the average We investigated the impact of parameters on assem- Rmap size is small so that we can extract an adequate bly results of E. coli by varying the k-mer size, the number of bi-labels from each Rmap. We show the parameter D (which denotes the length of the skip impact of varying the parameters on the E. coli genome segment, the parameter t , and the parameter t . f l in Section Impact of parameters. We considered the following set of values for these parameters: k ={5, 6, 7} , D ={10000, 15000, 20000} , Datasets t ={250, 500, 1000, 1500} , and t ={1500, 2000, 3000} . f l We performed experiments on both simulated and We show the impact of varying k, D and t in Fig.  5. real Bionano data. We simulated data from both E. The detailed statistics of this experiment are found in Fig. 5 Impact of varying parameters k, D, and t on the assembly of E. coli. For all possible combination of these parameters, we calculated and reported the mean contig size. The blue lines depict a k-mer size 5, the red lines depict a k-mer size 6, and the magenta lines depict a k-mer size 7 Mukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 10 of 13 Table 1 Impact of varying the values of t and t on the assembly results for E. coli data f l t t Run time(s) Peak Memory(Mb) No. of contigs Max Mean f l 250 1500 179 305 4 272 (2.420 Mbp) 267 (2.326 Mbp) 250 2000 200 305 3 271 (2.418 Mbp) 270 (2.351 Mbp) 250 3000 232 305 3 271 (2.419 Mbp) 267 (2.333 Mbp) 500 1500 403 459 10 336 (3.072 Mbp) 295 (2.625 Mbp) 500 2000 445 459 23 529 (4.701 Mbp) 371 (3.252 Mbp) 500 3000 509 459 49 529 (4.711 Mbp) 430 (3.793 Mbp) 1,000 1500 476 628 33 531 (4.745 Mbp) 427 (3.792 Mbp) 1,000 2000 533 629 29 529 (4.746 Mbp) 422 (4.746 Mbp) 1,000 3000 629 630 35 530 (4.742 Mbp) 412 (3.662 Mbp) 1,500 1500 537 705 11 424 (3.732 Mbp) 347 (3.028 Mbp) 1,500 2000 616 709 28 533 (4.778 Mbp) 424 (3.760 Mbp) 1,500 3000 748 711 22 535 (4.764 Mbp) 440 (3.887 Mbp) In this Table, the value of k was fixed to 6, and the value of D was fixed to 15,000. The contig with maximum length (Max) is reported in the number of fragments and the total genomic length in mega base pairs (Mbp). Similarly, the mean contig length (Mean) is also reported in the number of fragments and the total genomic length in mega base pairs Additional file  1: Table  S1. For this experiment, t was unitigs longer than 500 fragments, that are 529 and 522 fixed at 2,000. In Table  1, we show the impact of varing fragments in length, both of which covered the refer- t and t together. For all experiments, contigs longer ence from start to finish. f l than 250 fragments are reported. The experiments The Valouev assembler [24] took 204.8 hours to com - show that for t = 250 the assembly quality is poor. This pute pairwise alignments between all pairs of Rmaps is justified since the average sizing error exceeds 250. and an additional 30 minutes to assemble them into Similarly, for increasing values of D, we see a drop in contigs. It produced 5 contigs with the longest con- the quality. This is because larger values of D create tig of length 102 fragments (corresponding to a 1Mbp fewer number of bi-labels from an Rmap which reduces genomic span). We aligned the assembled contigs back the effective coverage of the data. Among the three k - to the reference and found the total genome coverage to mer sizes used, best assembly quality is achieved with be 48%. Bionano solve produced a high quality assem- k = 6 . This is set as our default k-mer value for all bly, i.e., one contig that spanned 100% of the genome. experiments. The assembly took 48.14 hours of CPU time (59.75 minutes of wall time using 60 CPUs in parallel) and Performance on E. coli peak memory of 1.18 GB. The Valouev aligner reported For the E. coli Rmap dataset, error correction took alignments for all contigs, hence we report zero mis- 2.66 hours of CPU time. The assembly results are sum - assembled contigs for all three methods. marized in Table  2. For this experiment we extracted In summary, the quality of Bionano Solve and bi-labels with k = 6 and D = 15000 and used error rmapper were comparable, yet rmapper was 480 times tolerance parameter setting t = 500 and t = 2000 . faster (6 minutes versus 2889 minutes) and used less f ℓ rmapper took 342 seconds and peak memory of 274 than 500 Mb of memory. Mb to assemble the data. The assembler produced two Table 2 Assembly results for E. coli Rmap data simulated by OMSim using enzyme BspQI Assembler Run time Peak Memory No. of contigs Max Mean GF(%) MA Valouev 8.5 d 0.48 5 102 (1.0 Mbp) 56 (0.5 Mbp) 48 0 Solve 48.1 h 1.18 1 631 (4.9 Mbp) 631 (4.9 Mbp) 100 0 rmapper 6 m 0.46 2 529 (4.6 Mbp) 526 (4.5 Mbp) 100 0 The dataset has 23,450 Rmaps of mean size of 42 fragments and coverage of 900x. The peak memory is given in gigabytes (GB). The run time is reported in second (s) minutes (m), hours (h) and days (d). rmapper was run with k = 6 , D = 15000 and error tolerance parameter setting t = 500 and t = 2000 . The contig with maximum f ℓ length (Max) is reported in the number of fragments and the total genomic length in mega base pairs (Mbp). Similarly, the mean contig length (Mean) is also reported in the number of fragments and the total genomic length in mega base pairs. The genome fraction (GF) is the percentage of the genome that is covered by at least one contig. Lastly, the number of mis-assembled contigs (MA) is given M ukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 11 of 13 Table 3 Assembly results for human Rmap data simulated by OMSim using enzyme BspQI Assembler Run time Peak Memory No. of contigs Max Mean GF(%) MA Valouev > 360 d n/a n/a n/a n/a n/a n/a Solve 122.4 d 94.8 169 14,133 (124.6 Mbp) 2,036 (16.4 Mbp) 93.8 4 rmapper 12.1 h 7.9 3865 1,380 (14.4 Mbp) 144 (1.4 Mbp) 95.8 0 rmapper 2.0 22.2 h 18.8 3524 1,752 (18.5 Mbp) 203 (2.0 Mbp) 96.7 0 The dataset has 377894 Rmaps of mean size of 61 fragments and coverage 80x. See Table 2 for a description of the assembly statistics and notation. As described in the text, rmapper 2.0 extracts bi-labels from Rmaps in both forward and reverse directions Performance on human contigs than rmapper , Bionano Solve produced the long- For the human Rmap dataset, error correction took est contigs but covered 93.8% of the genome and had 4 1339.31 seconds of wall time running cOMet in parallel mis-assembled contigs. In addition, rmapper2.0 has the on 2000 CPUs (corresponding to 524 hours of CPU time). highest genome fraction, which is 96.7%. Lastly, rmapper The assembly results are shown in Table  3. For this exper- and rmapper2.0 was 242 and 132 times faster than Solve, iment we extracted bi-labels with k = 6 and D = 25000 respectively, and used 5 times less memory. and used error tolerance parameter setting t = 1500 and t = 2000 . rmapper took 12.1 hours and peak memory Performance on climbing perch of 7.9 GB to assemble the data whereas rmapper 2.0 took Error correction of the the climbing perch (Anabas Tes- 22.2 hours and 18.8 GB of peak memory. rmapper pro- tudineus) Rmap dataset took 1.84 hours of wall time duced 3134 contigs whereas rmapper 2.0 produced 2867 running cOMet in parallel on 3000 CPUs (correspond- contigs. The maximum size unitig produced by rmapper ing to 2042 hours of CPU time). The assembly results and rmapper2.0 was 1380 and 1752 fragments in length, are shown in Table  4. For this experiment we extracted respectively. Lastly, rmapper achieved a net coverage of bi-labels with k = 6 and D = 15000 and used error tol- 95.8% while rmapper 2.0 was able to cover 96.7% of the erance parameter setting t = 1500 and t = 2000 . f ℓ genome—both with zero mis-assembled contigs. rmapper took 7.5 hours and peak memory of 9.7 GB to The Valouev assembler did not produce any output assemble the data whereas rmapper 2.0 took 14.9 hours after 360 CPU days so n/a is reported in Table 3. Bionano and 18.77 GB of peak memory. rmapper produced 4573 Solve produced comparably fewer but longer contigs to contigs whereas rmapper 2.0 produced 4972 contigs. The rmapper but had 4 mis-assembled contigs. In addition, maximum size unitig produced by rmapper and rmapper it took approximately 2937 CPU hours (55 hours of wall 2.0 was 217 and 294 fragments in length, respectively. time using 60 CPUs in parallel) and peak memory of 94.8 Lastly, rmapper achieved a genome fraction of 92.07%, GB. It is also worth noting that Bionano Solve performs while rmapper 2.0 was able to cover 95.05% of the an elaborate scaffolding and stitching of contigs, which genome. Both rmapper and rmapper 2.0 produced zero explains the relatively few number of contigs but higher mis-assemblies. mis-assembly rate. The scaffolding and stitching cannot The Valouev assembler did not halt on this dataset be decoupled from the assembly since Bionano only dis- after 360 CPU days so we do not report any results. Solve tributed a single executable that runs both. The source halted with a fatal error message in its final scaffolding code is not publicly available. step after 156 CPU days (93 hours of wall time using 60 In summary, the Valouev assembler did not scale to CPUs in parallel) and using a peak memory of 16 GB. We the human genome, rmapper2.0 produced slightly longer used the latest assembly result produced by the software Table 4 Assembly results for the Rmap dataset of the climbing perch genome Assembler Run time Peak Memory No. of contigs Max Mean GF(%) MA Solve 156 d 16 Gb 907 1032 (8.4 Mbp) 104 (7.9 Mbp) 97.6 5 rmapper 7.5 h 9.7 4573 217 (1.6 Mbp) 32 (0.28 Mbp) 92.07 0 rmapper2.0 14.9 h 18.8 4972 294 (2.4 Mbp) 42 (0.4 Mbp) 95.05 0 The data was generated for the Vertebrate Genomes Project and it consists of 3121480 Rmaps with mean size of 28 fragments. The restriction enzyme used in the experiment is BspQI. See Table 2 for a description of the assembly statistics and notation. As described in the text, rmapper 2.0 extracts bi-labels from Rmaps in both forward and reverse directions. Bionano Solve halted with a fatal error message in its final scaffolding step. We used the latest assembly result produced by the Solve in order to compare their assembly quality Mukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 12 of 13 in order to compare their assembly quality. Similar to the regions in the graph, however, we believe that there is a human assembly, Bionano Solve produced comparably great opportunity to improve the length of the assem- fewer and longer contigs to rmapper and had a genome bled optical maps by devising an algorithm to extend the coverage of 97.8%—but had 5 mis-assembled contigs. traversal. Next, we hypothesize that by adapting meth- In summary, the Valouev assembler did not scale to the ods designed for scaffolding and stitching optical map - human genome, rmapper2.0 produced slightly longer ping data [39, 40], the length of the assembled optical contigs than rmapper , Bionano Solve produced the long- maps can be improved. Lastly, we note that there does est contigs and covered 97.6% of the draft genome but not exist a method to evaluate optical map assembles like had 5 mis-assembled contigs. rmapper2.0 has compara- there does for genome assemblies—QUAST [37] being ble genome coverage to Solve, which is 95.05% — while the well-known genome assembly evaluation method. running 251 times faster. Furthermore, although some of the metrics of genome assembly evaluation tools (e.g., mean contig length and length of the longest contig) trivially extend to optical Discussion and future work map assemblies, metrics that require sequence alignment We implement our approach and show its performance to a reference genome (e.g., number of mis-assemblies) on multiple simulated and real datasets. Our experimen- do not extend and need redevelopment. tal results show the only non-proprietary method (i.e. by Valouev et  al. [24]) is unable to scale to the human and fish genomes, and that our method is at least 130 times Conclusion faster than Bionano Solve and its memory usage is less Assembly of Rmap data is a fundamental problem in than 20% of the memory usage of Bionano Solve. We optical mapping that still remains in a nascent stage— point out that there is a trade-off between the length of as prior to this work, there was only a single other non- the contigs, the genome fraction, and number of mis- proprietary assembler. In this paper, we formulate and assemblies. Analogous to assembly of short reads, ideally describe the first de Bruijn graph approach for Rmap an assembler should return a small number of contigs assembly by redefining the de Brujn graph to adapt it to or scaffolds which cover the entire genome and have no Rmap data. We accomplish this by extending the defini - mis-assembled regions. In the case of the human and fish tion of a bi-label introduced in the context of the paired- data, Solve was able to produce fewer and longer scaf- end de Bruijn graph by Medvedev et  al. [28]. We refer folds than rmapper but produced more mis-assemblies to our modified de Bruijn graph as the bi-labelled de than rmapper . Conversely, for the human data, rmapper Bruijn graph and demonstrate how to efficiently build produced contigs that covered a larger fraction of the and store it using a two-tiered orthogonal range search genome with no mis-assembled regions. This highlights data-structure. one trade-off in Rmap assembly. Hence, there is an We implement this approach, leading to a novel Rmap opportunity to improve Rmap assemblers so that this gap assembler that we call rmapper . We compare the per- between Solve and rmapper is closed. Another impor- formance of our method with the assembler of Valouev tant note about the comparison between the assemblers et  al., and Bionano Solve on three genomes of varying is that rmapper has a very simple traversal algorithm and size: E. coli, human, climbing perch (a fish species from does not use any sort of scaffolding. This is due to the fact the Vertebrate Genomes Project). Our comparison dem- that the main contribution of this work is formulating onstrates that rmapper was more than 130 times faster and solving the problem of assembly of Rmaps. Bionano and used less than five times less memory than Solve, Solve has a scaffolding algorithm that cannot be decou - and was more than 2,000 times faster than Valouev et al.. pled from the assembly step since only an executable is Consequently, rmapper successfully assembled the 3.1 available. Thus, the results really compare rmapper ’ s unit- million Rmaps of the climbing perch genome into con- igs with Solve’s scaffolds, and rmapper is still comparable. tigs that covered over 95% of the draft genome with zero This work presents the first non-proprietary Rmap mis-assemblies. assembler developed in the past decade, and thus, opens the door for improving Rmap assembly. Thus, there are Supplementary Information many related problems and possible improvements that The online version contains supplementary material available at https:// doi. warrant future research. First, the main contribution of org/ 10. 1186/ s13015- 021- 00182-9. our work was adapting the de Bruijn graph to Rmap data. For completeness, we perform depth first search to trav - Additional file 1: Table S1. Impact of varying the values of k, D and t on erse the bi-labelled de Bruijn graph and extract contigs. the assembly results for E. coli data. Our traversal does not attempt to reconcile complicated M ukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 13 of 13 Acknowledgements 15. Dong Y, et al. Sequencing and automated whole-genome optical map- Not applicable ping of the genome of a domestic goat (Capra hircus). Nat Biotechnol. 2013;31:135. Authors’ contributions 16. Mukherjee K, Alipanahi B, Kahveci T, Salmela L, Boucher C. Aligning opti- KM developed the software and carried out all experiments. All authors con- cal maps to de Bruijn graphs. Bioinformatics. 2019;35(18):3250–6. tributed towards the design of the algorithm and writing the manuscript. All 17. Muggli MD, Puglisi SJ, Boucher C. Efficient indexed alignment of contigs authors read and approved the final manuscript. to optical maps; 2014. pp. 68–81 18. Muggli MD, Puglisi SJ, Boucher C. A Succinct Solution to Rmap Align- Funding ment. In: 18th International Workshop on Algorithms in Bioinformatics This work was supported by NSF IIS (Grant No. 1618814) and Academy of ( WABI 2018), vol. 113; 2018. pp. 12–11216. Finland (Grants 308030, 335553, and 323233 to LS). 19. Muggli MD, Puglisi SJ, Boucher C. Kohdista: an efficient method to index and query possible rmap alignments. Algorithms Mol Biol. 2019;14:25. Availability of data and materials 20. Leung AK-Y, Kwok T-P, Wan R, Xiao M, Kwok P-Y, et al. Omblast: alignment Our software, rmapper is written in C++ and is publicly available under GNU tool for optical mapping using a seed-and-extend approach. Bioinformat- General Public License at https:// github. com/ kingu fl/ Rmapp er . Additional ics; 2016. 620. data can be accessed from the Github repository. 21. Mendelowitz LM, Schwartz DC, Pop M. Maligner: a fast ordered restriction map aligner. Bioinformatics. 2016;32(7):1016–22. 22. Verzotto D, et al. Optima: Sensitive and accurate whole-genome align- Declarations ment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis. GigaScience. 2016;5(1):2. Ethics approval and consent to participate 23. Anantharaman TS, Mishra B, Schwartz DC. Genomics via optical mapping Not applicable. iii: Contiging genomic DNA and variations (extended abstract). New York: AAAI Press; 1997. p. 18–27. Consent for publication 24. Valouev A, Schwartz DC, Zhou S, Waterman MS. An algorithm for assem- Not applicable. bly of ordered restriction maps from single dna molecules. Proc Natl Acad Sci USA. 2006;103(43):15770–5. Competing interests 25. Valouev A, et al. Alignment of optical maps. J Comp Biol. The authors declare that they have no competing interests. 2006;13(2):442–62. 26. Idury RM, Waterman MS. A new algorithm forDNA sequence assembly. J Author details Comput Biol. 1995;2(2):291–306. Department of Computer and Information Science and Engineering, Herbert 27. Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA Wertheim College of Engineering, University of Florida, Gainesville, USA. fragment assembly. Proc Natl Acad Sci. 2001;98(17):9748–53. Department of Computer Science, Helsinki Institute for Information Technol- 28. Medvedev P, Pham S, Chaisson M, Tesler G, Pevzner P. Paired de Bruijn ogy, HIIT, University of Helsinki, Helsinki, Finland. graphs: a novel approach for incorporating mate pair information into genome assemblers. J Comput Biol. 2011;18:1. Received: 19 January 2021 Accepted: 13 April 2021 29. Li M, et al. Towards a more accurate error model for BioNano optical maps. In: ISBRA 2016; 2016. pp. 67–79. 30. Chen P, Jing X, Ren J, Cao H, Hao P, Li X. Modelling BioNano optical data and simulation study of genome map assembly. Bioinformatics. 2018;34(23):3966–74. References 31. Mukherjee K, Washimkar D, Muggli MD, Salmela L, Boucher C. Error cor- 1. Schwartz DC, Li X, Hernandez LI, Ramnarain SP, Huff EJ, Wang Y-K. Ordered recting optical mapping data. GigaScience. 2018;7:1. restriction maps of saccharomyces cerevisiae chromosomes constructed 32. Bentley JL. Multidimensional binary search trees used for associative by optical mapping. Science. 1993;262:110–4. searching. Commun ACM. 1975;18(9):509–17. 2. Li L, et al. OMSV enables accurate and comprehensive identification of 33. Bankevich A, et al. SPAdes: a new genome assembly algorithm and its large structural variations from nanochannel-based single-molecule opti- applications to single-cell sequencing. J Comput Biol. 2012;19(5):455–77. cal maps. Genome Biol. 2017;18(1):230. 34. Zerbino DR, Birney E. Velvet: Algorithms for de novo short read assembly 3. Fan X, Xu J, Nakhleh L. Detecting large indels using optical map data. In: using de Bruijn graphs. Genome Res. 2008;18(5):821–9. RECOMB-CG. LNCS, vol. 11183, pp. 108–127. Springer, 2018. 35. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I. ABySS: 4. Ganapathy G, et al. De novo high-coverage sequencing and annotated a parallel assembler for short read sequence data. Genome Res. assemblies of the budgerigar genome. GigaScience. 2014;3:11. 2009;19(6):1117–23. 5. Chamala S, et al. Assembly and validation of the genome of the non- 36. Peng Y, Leung HC, Yiu S-M, Chin FY. IDBA-UD: A de novo assembler for model basal angiosperm amborella. Science. 2013;342(6165):1516–7. single-cell and metagenomic sequencing data with highly uneven 6. Teague B, et al. High-resolution human genome structure by single- depth. Bioinformatics. 2012;28(11):1420–8. molecule analysis. Proc Natl Acad Sci USA. 2010;107(24):10848–53. 37. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool 7. Muggli MD, Puglisi SJ, Ronen R, Boucher C. Misassembly detection using for genome assemblies. Bioinformatics. 2013;29(8):1072–5. paired-end sequence reads and optical mapping data. Bioinformatics. 38. Miclotte G, Plaisance S, Rombauts S, Van de Peer Y, Audenaert P, et al. 2015;31(12):80–8. OMSim: a simulator for optical map data. Bioinformatics. 2017;1:2740–2. 8. Pan W, Lonardi S. Accurate detection of chimeric contigs via BioNano 39. Pan W, Jiang T, Lonardi S. OMGS: optical map-based genome scaffolding. optical maps. Bioinformatics. 2018;35(10):1760–2. J Comput Biol. 2020;27(4):519–33. 9. Reslewic S, et al. Whole-genome shotgun optical mapping of Rho- 40. Shelton JM, Coleman MC, Herndon N, Lu N, Lam ET, Anantharaman dospirillum Rubrum. Appl Environ Microbiol. 2005;71(9):5511–22. T, Sheth P, Brown SJ. Tools and pipelines for BioNano data: molecule 10. Zhou S, et al. A whole-genome shotgun optical map of Yersinia pestis assembly pipeline and fasta super scaffolding tool. BMC Genomics. strain KIM. Appl Environ Microbiol. 2002;68(12):6321–31. 2015;16(1):734. 11. Zhou S, et al. Shotgun optical mapping of the entire leishmania major Friedlin genome. Mol Biochem Parasitol. 2004;138(1):97–106. Publisher’s Note 12. Zhou S, et al. Validation of rice genome sequence by optical mapping. Springer Nature remains neutral with regard to jurisdictional claims in pub- BMC Genom. 2007;8(1):278. lished maps and institutional affiliations. 13. Zhou S, et al. A single molecule Scaffold for the Maize Genome. PLoS Genet. 2009;5:1000711. 14. Church DM, et al. Lineage-specific biology revealed by a finished genome assembly of the mouse. PLoS Biol. 2009;7(5):1000112. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Algorithms for Molecular Biology Springer Journals

Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph

Loading next page...
 
/lp/springer-journals/fast-and-efficient-rmap-assembly-using-the-bi-labelled-de-bruijn-graph-AGj1MUfg9W

References (51)

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2021
eISSN
1748-7188
DOI
10.1186/s13015-021-00182-9
Publisher site
See Article on Publisher Site

Abstract

Genome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there are very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary software that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algo- rithm behind the proprietary method, Bionano Genomics’ Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as rmapper , and com- pare its performance against the assembler of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) and Solve by Bionano Genomics on data from three genomes: E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) only successfully ran on E. coli. Moreover, on the human genome rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, rmapper is written in C++ and is publicly available under GNU General Public License at https:// github. com/ kingu fl/ Rmapp er . Keywords: Optical mapping, Single molecule maps, de Bruijn graph, Overlap-layout-consensus, Genome assembly, Mis-assemblies Introduction u Th s, optical mapping has assisted in the assembly of a In 1993 Schwartz et al. developed optical mapping [1], a variety of species – including various prokaryotic species system for creating an ordered, genome wide high reso- [9–11], rice [12], maize [13], mouse [14], goat [15], parrot lution restriction map of a given organism’s genome. [4], and amborella trichopoda [5]. Bionano Genomics has Since this initial development, genome wide optical maps enabled the automated generation of the data, enabling have found numerous applications including discover- the data to become more wide-spread. For example, Bio- ing structural variations [2, 3], scaffolding and validating nano data was generated for 133 species sequenced for contigs for several large sequencing projects [4, 5], and the Vertebrate Genomes Project. detecting mis-assembled regions in draft genomes [6–8]. Similar to sequencing, the protocol for producing opti- cal mapping data, begins with many fragmented copies of the genome of interest. This redundancy allows over - *Correspondence: kingdgp@ufl.edu lap between the raw data and assembly into longer con- Department of Computer and Information Science and Engineering, tiguous regions corresponding to the genome. With a Herbert Wertheim College of Engineering, University of Florida, Gainesville, USA selected enzyme, the genomic DNA fragments are nicked Full list of author information is available at the end of the article © The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Mukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 2 of 13 at each restriction site recognized by the enzyme. These by traversing this graph in a depth first manner. Bio - cleaved fragments are then photographed and analyzed nano Genomics Inc. provides a proprietary assembly in order to determine the length (in kbp) of the regions method, called Bionano Solve, however the source code between nick sites. The result of this process are opti - is not publicly available and the algorithmic details are cal maps for all the fragments, which are referred to as unknown due to the proprietary nature of the software. Rmaps. For example, given a genome fragment TTT TAA The alternative to an OLC approach for assembly is a CTG GGG GGG AAC TTT TTT TTA ACT TTTT and an enzyme de Bruijn graph approach that relies on building and tra- that recognizes the site AACT and cleaves in the mid- versing a de Bruijn graph constructed on the sequence dle, the resulting Rmap would be [6, 11, 11, 6]. Rmaps data. For simplicity, we give a constructive definition of by themselves are not traditionally used for analysis— the de Bruijn graph in the context of genome assembly. although, they can be [2, 3, 16]—and instead have to Given a set of sequences R ={r , . . . , r } and an integer k, 1 n be assembled into longer contiguous optical maps cor- the de Bruijn graph is constructed by creating a directed responding to the genome. Hence, assembly of Rmaps edge for each unique k length substring (k-mer) with the refers to the problem of generating a consensus genome nodes labeled as the k − 1 length prefix and k − 1 length wide optical map from overlapping Rmaps. suffix of the k-mer, and then all nodes that have the same Although optical mapping has been around for several label are merged. The important aspect of the de Bruijn decades, the problem of efficiently assembling the data graph assembly approach is that it avoids having to find largely remains open as there has been little work in this alignments between any pair of sequences, leading to area—which is largely due to the challenges posed by the an O(n) run-time. Since its introduction by Idury et  al. data itself. We should note that several related problems, [26] and Pevzner et  al. [27], this approach has become such as alignment of optical mapping data [16–22], have the most common paradigm for assembling short read been more thoroughly explored. Rmap data has a num- sequencing data because it led to huge gains in perfor- ber of errors that make it difficult to assemble—namely, mance over OLC approaches. Hence, applying a de Bruijn there exists added and deleted cut sites and sizing error, graph approach to Rmap assembly would likely lead to resulting in extra fragments, merges in neighboring frag- similar improvements by removing the burden of find - ments and under or over-estimates of the length of a ing all pairwise alignments between Rmaps. This assem - fragment. In the running example, the error free Rmap of bly works on the premise that a k-mer will occur exactly [6, 11, 11, 6] could occur as [6, 22, 6] with error. Nonethe- without error frequently in the data. Hence, the biggest less, there exists two Rmap assembly methods: Gentig by challenge we face is constructing a de Bruijn graph with Anantharaman et  al. [23] and the assembler of Valouev added and deleted cut-sites and sizing error. Even with- et  al. [24]. Developed in 1998, Gentig is the first Rmap out the occurrence of added and deleted cut-sites, k-mers assembly algorithm. It is based on a Bayesian model that created from Rmap data are unlikely to be exact repli- seeks to maximize the a posteriori estimate of the con- cas due to sizing error. For example, [6, 11, 11, 6] and [5, sensus optical map produced by the assembly of Rmaps. 10, 11, 7] should likely be recognized as instances of the It first computes the overlap between all pairs of Rmaps same k-mers in Rmap data. Thus, to overcome this chal - using dynamic programming, and then builds contigs by lenge the de Bruijn graph has to be redefined to account greedily merging the Rmaps based on alignment score. for the inexactness of the data. This process of merging contigs continues until all align - In this paper, we formulate and describe a de Bruijn ments above a certain score are merged. Valouev et  al. graph approach for de novo Rmap assembly, which heav- [24] implemented an overlap-layout-consensus (OLC) ily relies on redefining the de Bruijn graph to make it assembly algorithm using their alignment algorithm suitable for Rmap data. We accomplish this by extend- [25], which also starts by calculating alignment between ing the definition of a bi-label in the context of the paired all pairs of Rmaps, and identifying all alignments that de Bruijn graph that was introduced by Medvedev et  al. have score above a specified threshold. A graph is built, [28]. We refer to our modified de Bruijn graph as bi- where Rmaps are represented as nodes, and the non-fil - labelled de Bruijn graph. Next, we demonstrate how to tered alignments are represented as edges. The graph is efficiently build and store the de Bruijn graph using a two refined by eliminating paths in the graph that are weakly tier orthogonal-range search data structure. We imple- supported. In other words, if two connected regions ment this approach, leading to a novel Rmap assembler in the graph are joined by only a single path—or with that we call rmapper . We compare the performance of multiple paths, but having one or more common inter- our method with the assembler of Valouev et  al., and mediate nodes—then the graph is disconnected at these Bionano Solve on three genomes of varying size: E. coli, nodes. Further, an edge is removed if it is inconsistent human, climbing perch (a fish species from the Verte - with a higher scoring edge. Contigs are then generated brate Genomes Project). Our comparison demonstrates M ukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 3 of 13 that rmapper was more than 130 times faster and used et al. and Chen et al. predict a fixed probability for diges - less than five times less memory than Solve, and was tion of a cut-site while Li et al. model the probability more than 2,000 times faster than Valouev et  al. Also, of digestion as a function of lengths of the fragments rmapper successfully assembled the 3.1 million Rmaps flanking the cut-site. The likelihood of a missed cut-site of the climbing perch genome into contigs that covered decreases with the length of the fragment. All three mod- over 95% of the draft genome with zero mis-assemblies. els postulate additional or false cut-sites result from ran- dom breaks of the DNA molecule and hence model the Background and definitions number of false cuts per unit length of DNA as a Poisson Rmap data and genome wide optical maps distribution. Li et al. observed that false cuts occurred From a computer science perspective, we can view an less frequently at the two ends of an Rmap. Rmap R =[r , r , . . . , r ] as an ordered list of integers. 1 2 |R| Each number represents the length of the respective frag- Rmap segments and k‑mers ment. The size of an Rmap R denotes the number of frag- We define a segment s of an Rmap starting at position p,q ments in R, which we denote as |R|. For example, say we p and ending at position q, as the q − p + 1 consecutive have an enzyme that cleaves the DNA at the middle posi- fragments starting from r , i.e., [r , r , .., r ] . We define p p p+1 q tion of AACT and a genomic sequence TTT TAA CTG GGG the length of a segment as the summation of all of its GGG AAC TTT TTT TTA ACT TTTT , then the Rmap will be constituent fragments, i.e., r + ··· + r . We denote the p q R =[6, 11, 11, 6] corresponding to the cleaved sequences length of a segment s as ℓ(s ) . We note that the length p,q p,q [TTT TAA , CTG GGG GGGAA, CTT TTT TTTAA, CTT of the Rmap R should not be confused with the number TTT ]. of fragments, which we denote as its size |R|. In this paper, we extend the definition of a k-mer to the Error profile of Rmap data context of Rmap data as follows. Given an integer k, we There are three types of errors that can occur in optical define a k-mer as a segment of exactly k fragments, i.e., a mapping: (1) missing cut sites which are caused by an sequence of k successive fragments of an Rmap. Follow- enzyme not cleaving at a specific site, (2) additional cut ing the example from above, the following two 3-mers sites which can occur due to random DNA breakage and exist in R =[6, 11, 11, 6] : [6, 11, 11] and [11, 11, 6]. (3) inaccuracy in the fragment size due to the inability of the system to accurately estimate the fragment size. Prefixes and suffixes of Rmaps Continuing again with the example above, an example Given an Rmap R =[r , r , . . . , r ] , we define the x-size 1 2 |R| of an additional cut site would be when the second frag- prefix of R as R =[r , r , . . . , r ] , where x is at most 1 2 x ment of R is split into two, e.g., R =[6, 5, 6, 11, 6] , and |R|− 1 . Conversely, we define the x-size suffix of R as an example of a missing cut site would be when the last R =[r , . . . , r ] , where x is at most |R|− 1. |R|−x+1 |R| two fragments of R are joined into a single fragment, e.g., R =[6, 11, 17] . Lastly, an example of a sizing error would The Bi‑labelled de Bruijn graph be if the size of the first fragment is estimated to be 7 In this section, we modify the traditional definition of the rather than 6. de Bruijn graph for Rmap data by first redefining the con - Several different probabilistic models have been pro - cept of a bi-label for Rmap data. The term bi-label was posed for describing the sizing error, and the frequency first introduced by Medvedev et al. [28] in the context of of added and missed cut-sites, including the models of short read assembly to incorporate mate-pair data into Valouev et al. [25], Li et al. [29], and Chen et al. [30]. We assembly of paired-end reads. There the term bi-label briefly describe these models here but refer to the origi - refers to two k-mers separated by a specified genomic nal papers for a full description. Both Valouev et  al. and distance. The redefinition of the de Bruijn graph with Chen et  al. describe the observed fragment lengths as this extra information was shown to de-tangle the result- normal distribution with the mean being equal to the ing graph, making traversal more efficient and accurate. true length of the fragment and the standard deviation Here, we demonstrate that an equivalent paradigm can being a function of the true length, i.e. longer fragments be effective for Rmap assembly. exhibit larger standard deviation. In the model by Li et al. the sizing error uses a Laplace distribution as follows: if Bi‑labels the observed and actual size of a fragment are o and r , Given integers k and D, and Rmap R, we define a bi-label i i respectively, then the sizing error, o ∼ r × Laplace(µ , β) from an Rmap R, as a segment of R containing a pair of i i where µ and β are parameters of the Laplace distribution k-mers separated by the shortest segment that has a and are functions of r . All studies model the probability length of at least D. The following is a formal definition. of having a missed cut-site as a Bernoulli trial. Valouev Mukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 4 of 13 Bi‑label proximity Definition 1 Given an Rmap One of the challenges with Rmap data is the fact that the R =[r , r , ..., r , r , .., r ] , integers k and D, and a 1 2 i i+1 |R| fragments correspond to genomic distances and due to position i, we define the bi-label at position i to be 1 2 experimental error, the measured estimates for the same [s , r , . . . , r , s ] , where p = i + k and q is an index such p q k k genomic fragment are different across different Rmaps that ℓ(s )< D ≤ ℓ(s ) p,q−1 p,q representing the same genomic location. For example, 1 2 R =[5, 6, 7, 11, 5] and R =[6, 5, 6, 11, 6] likely correspond and s and s are the k-mers starting at positions i and k k to the same k-mer but the numerical nature makes it such q + 1 , resp e ctively . that they are not exactly equal. Thus, we need to define a 1 2 criteria such that two bi-labels drawn from different Rmaps Next, we refer to segment s between s and s as the p,q k k 1 2 but corresponding to the same genomic locations can be skip segment, and note that, unlike s and s which both k k identified and merged for the construction of the de Bruijn have k fragments, this segment is only bounded by its graph. Thus, to make the definition of a bi-label robust to length and can have any number of fragments. u Th s, this sizing errors, we define conditions on both the difference accounts for added and deleted cut-sites since these of the individuals fragments of two bi-labels and the dif- errors do not impact the length of a segment. Figure  2 ference in the total lengths. Hence, we have the following demonstrates how the skip-segment tolerates a deleted definitions. cut-site. For example, given k = 3 , D = 25 , and R =[7, 18, 13, 3, 15, 12, 4, 3, 6, 5, 13, 2] , the bi-labels of R Definition 4 Given integers t , k and D, and two are [7, 18, 13] [3, 15, 12] [4, 3, 6] , f bi-labels a and b, we let the k-mers of a and b be [18, 13, 3][15, 12][4, 3, 6] and [13, 3, 15][12, 4, 3, 6] 1 1 1 2 2 2 1 1 1 a =[a , .., a ] and a =[a , .., a ] and b =[b , .., b ] 1 k 1 k 1 k 2 2 2 [5, 13, 2] . We are now going to define the prefix and suf - and b =[b , .., b ] , respectively. We define a and b to 1 k 1 1 be fragment proximal if and only if |a − b |≤ t and fix bi-labels. f i i 2 2 |a − b |≤ t for all i = 1, .., k. i i Definition 2 Given integers D and k and bi-label b with 1 1 1 2 2 2 Here t is an error-tolerance parameter that handles siz- k-mers b =[b , ..b ] and b =[b , .., b ] and skip seg- 1 k 1 k ing errors on the fragments of the bi-label. ment b , we define the prefix bi-label of b as the bi-label with (k − 1)-mers and skip-segment length at least D, Definition 5 Given integers t , k and D, and two bi- where the first (k − 1)-mer is the (k − 1)-size prefix of b 1 1 labels a and b, we let the k-mers of a and b be a and i.e. [b , ..b ]. 1 k−1 2 1 2 a and b and b , respectively, and the skip segment of s s a and b be a and b , respectively. We define a and b to Note that the second (k − 1)-mer of the prefix bi-label 1 1 be length proximal if and only if |ℓ(a ) − ℓ(b )|≤ t , is not necessarily the (k − 1)-size prefix of b . We also 2 2 s s |ℓ(a ) − ℓ(b )|≤ t and |ℓ(a ) − ℓ(b )|≤ t . ℓ ℓ require an equivalent definition for the suffix of a bi-label. Here t is another error-tolerance parameter that handles Definition 3 Given integers D and k and bi-label b with 1 1 1 2 2 2 sizing errors on the segment lengths of the bi-label. These k-mers b =[b , ..b ] and b =[b , .., b ] and skip seg- 1 k 1 k two definitions lead to our final definition that defines ment b , we define the suffix bi-label of b as the bi-label whether two bi-labels should be defined as equivalent in with (k − 1)-mers and skip-segment length at least D, the de Bruijn graph. where the first (k − 1)-mer is the (k − 1)-size suffix of b 1 1 i.e. [b , ..b ]. 2 k Definition 6 Given integers k and D and two bi-labels a and b, we define them to be proximal if and only if they Figure  1 illustrates this concept of prefix and suffix are fragment proximal and length proximal. bi-labels. Note that for two successive bi-labels from an Rmap, the prefix bi-label of the latter is the same as the This leads to our final definition, which is the set of bi- suffix bi-label of the former as shown in Fig.  1. This is a labels in which the bi-labelled de Bruijn graph is defined vital property that allows the de Bruijn graph constructed on. over bi-labels to be connected. Definition 7 Given a set of Rmaps {R , .., R } and inte- 1 n gers k and D, let B be the set of bi-labels from R. We define the proximal reduced set of bi-labels as the set M ukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 5 of 13 Fig. 1 All bi-labels for k = 3 and D = 25 of an Rmap R. On each bi-label the fragments from the k-mers and the length of the skip segment are shown in white while the fragments of the skip segment are shown in blue. For each bi-label we show the prefix and suffix bi-labels built with k = 2 and D = 25 ′ ′ gluing operation. A pair of node bi-labels are glued into B , where for each b in B there is a bi-label in B that it is a single node if and only if they are proximal. We define proximal to. the final graph obtained after gluing of nodes as the bi- labelled de Bruijn graph. Definition of the bi‑labelled de Bruijn graph Given the above definitions, we are now ready to define Methods the bi-labelled de Bruijn graph built on a set of proxi- In this section, we describe our method for building and mal bi-labels extracted from Rmaps. traversing the bi-labelled de Bruijn graph from an Rmap dataset. Our method, which we refer to as rmapper , can Definition 8 Given integers k and D and set of Rmaps be summarized into the following steps: extract and store {R , .., R } , let B be the proximal reduced set of bi-labels 1 n bi-labels, find proximal bi-labels, build the bi-labelled de extracted from R. We create a directed edge e for each bi- Bruijn graph, resolve tips and bubbles, and traverse the label b in B and label the incoming and outgoing nodes of graph to build the contigs. We now describe each of these e as the prefix bi-label of b and suffix bi-label of b, respec - steps in detail. tively. After all edges are formed, the graph undergoes a Mukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 6 of 13 ′ ′ ′ Fig. 2 Skip segment overcomes missed cut-site. All bi-labels for k = 3 and D = 25 of two Rmaps R and R , {b , b , b } and {b , b } respectively. Both 1 2 3 1 2 Rmaps cover the same genomic location but R has a missed cut-site in position 5 (shown in red). On each bi-label the fragments from the k-mers and the length of the skip segment are shown in white while the fragments of the skip segment are shown in blue. Despite the missed cut-site on ′ ′ ′ R bi-labels b and b are merged to b and b respectively according to our merge function 1 2 1 2 2 2 2 Extract and store all Bi‑labels and b =[b , .., b ] . We perform a range query with 1 k 1 1 2 2 We first error correct the Rmap data using cOMet [31] ([b ± t ], . . . , [b ± t ], [b ± t ], . . . , [b ± t ]) in the f f f f 1 k 1 k and then extract and store all bi-labels from the error disjoint set of k-d trees to find all bi-labels whose first k- 1 1 corrected Rmaps. We recall from Definition 6 that two mer is equal to [b ± t ], . . . , [b ± t ] and whose second f f 1 k 2 2 bi-labels are proximal if they are both fragment proximal k-mer is equal to [b ± t ], . . . , [b ± t ] . We add a pointer f f 1 k as well as length proximal for error-tolerance param- from b to each of these bi-labels. We repeat this for each eters t and t . Therefore, we must store all the bi-labels f ℓ bi-label. In particular, we perform the range query in in a manner that allows finding all proximal bi-labels of all k-d trees where the proximal bi-labels can be found, a given bi-label efficiently. To accomplish this, we store ′ ′ i.e., all k-d trees K where for m = min(kt , t ) we f ℓ a ,a ,a 1 2 1 ′ 1 all the bi-labels in a disjoint set of k-d trees [32] such that have, ⌊(ℓ(b ) − m)/t ⌋≤ a ≤⌊(ℓ(b ) + m)/t ⌋ and ℓ ℓ 2 ′ 2 each pair of bi-labels in the same k-d tree is length proxi- ⌊(ℓ(b ) − m)/t ⌋≤ a ≤⌊(ℓ(b ) + m)/t ⌋. ℓ ℓ mal. For each bi-label, the 2k fragments of the k-mers of it We note that k-d trees support multi-dimensional (2k−1)/2k are stored in the corresponding k-d tree, which will allow orthogonal range-search queries in O(n + occ) for efficiently finding all fragment proximal bi-labels of a time and O(n) space where n is the number of bi-labels given bi-label. Hence, the dimension of each k-d tree is in the tree, k is the k-mer value, and occ is the number of 2k. bi-labels that satisfy the constraints of the range-search More formally, we identify each k-d tree K by a ,a ,a 1 2 3 query. three positive integers a , a , and a , and insert a given bi- 1 2 3 label b into K if the length of its two k-mers ℓ(b ) and a ,a ,a 1 2 3 ℓ(b ) are within the range [a × t , . . . , (a + 1) × t − 1] 1 ℓ 1 ℓ Graph construction and [a × t , . . . , (a + 1) × t − 1] respectively and the 2 ℓ 2 ℓ We first filter all low frequency bi-labels, i.e., bi-labels length of the skip segment ℓ(b ) is also within the range that have a low number of proximal bi-labels. As illus- [a × t , . . . , (a + 1) × t − 1] . If such a tree does not 3 ℓ 3 ℓ trated in Fig.  4, bi-labels that have low frequency typi- exist then we create a new one with K , where a ,a ,a 1 2 3 cally arise from Rmap data that is highly erroneous. After 1 2 s a =⌊ℓ(b )/t ⌋ , a =⌊ℓ(b )/t ⌋ and a =⌊ℓ(b )/t ⌋. 1 ℓ 2 ℓ 3 ℓ filtering low frequency bi-labels, we build the bi-labelled Next, for each bi-label in our set of k-d trees, we find de Bruijn graph by first building a proximal reduced set and store pointers to all proximal bi-labels by performing from the unfiltered bi-labels, then building all directed an orthogonal range query. Given a bi-label b in K , a ,a ,a 1 2 3 edges with labelled nodes from the reduced set, and 1 1 1 we let the k-mers of the bi-label b be b =[b , .., b ] 1 finally merging nodes that have the same label. Using k M ukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 7 of 13 an efficient heuristic, we first greedily find the proximal adjacency lists of two nodes to be efficiently merged if the reduced set of bi-labels by sorting the unfiltered bi-labels bi-labels they represent are found to be proximal. in descending order based on the number of proximal bi- Lastly, we merge all nodes in the graph whose bi- labels found for them. From this sorted list of bi-labels labels are proximal to obtain the final bi-labelled de B, we iteratively insert bi-labels into the reduced set B Bruijn graph. For merging the nodes, we again use a set unless the bi-label is proximal to a bi-label already in B . of disjoint k-d trees as we did before for finding proxi - Next, we build a bi-labelled de Bruijn graph by creating mal bi-labels for the edge bi-labels. Hence, we extract ′ ′ a directed edge for each bi-label b in B and labeling the all the node bi-labels and construct a set of k-d trees as incoming and outgoing nodes as the prefix bi-label and suf - before. Then for each node v in the node array, we query x bi-l fi abel of b . We store all the nodes and edges in a mod- the corresponding k-d trees to find all nodes that are ified adjacency list format that contains three arrays: one proximal to it using the same error tolerance param- array stores all node bi-labels, one array containing a list of eters t and t . Any node u that is found to be proximal f ℓ pointers of the incoming nodes for each node, and lastly, to v is merged to v by removing u from the graph by one array containing a list of pointers of the outgoing nodes updating the two adjacency lists such that the incom- for each node. u Th s, to insert b into the graph, we first ing and outgoing array entries storing pointers to u are determine if the prefix and suffix bi-labels are contained in updated to store pointers to v. This can be achieved the node array and insert them if they are not contained in in linear time. We repeat this until all proximal nodes the list, and then insert an entry into the incoming and out- have been merged. Figure 3 illustrates the construction going arrays with lists containing pointers to the prefix and of the bi-labelled de Bruijn graph for a pair of Rmaps. suffix bi-labels. This graph representation will allow for the Fig. 3 The construction of the bi-labelled de Bruijn Graph. a Two Rmaps R and R and the bi-labels extracted from them—{b , b , b } from R and 1 2 1 2 3 1 {b , b } from R for k = 3 and D = 25 . b Edges {e , e , e } depict the proximal reduced set of bi-labels. Bi-labels {b , b } are represented by e , bi-labels 3 4 2 1 2 3 1 4 1 {b , b } are represented by e and bi-label {b } forms e . We note that in this example no bi-labels are filtered for finding the proximal reduced set. 2 5 2 3 3 c Nodes introduced into the graph. Each edge breaks into two nodes—one denoted by the prefix bi-label and the other by suffix bi-label of the edge. A directed edge is drawn from the former to the latter. d The final graph is formed by merging nodes v with v and merging v with v 12 21 22 32 Mukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 8 of 13 an entire path from the graph to resolve a bubble—rather, we only disconnect them at the branching node. Follow- ing the work of Simpson et al. [35], we fix the maximum length of the paths in a bubble to twice the size of the bi-label. After cleaning, our traversal algorithm extracts unit- igs (i.e. contigs corresponding to unary paths) from the graph by performing a simple depth first traversal start - ing from each node with zero incoming edges. We termi- nate the traversal of a given path if a cycle is reached or a node with out-degree greater than one is reached. Experiments In this section, we compare the performance of rmapper , the assembler of Valouev et  al. and Bionano Solve. We used the most recent version of Bionano Solve that is publicly available (version 3.5.1.). We performed all Fig. 4 Histogram showing the precision of finding proximal bi-labels. experiments on Intel E5-2698v3 processors with 192 GB For simulated human Rmap data, we found proximal bi-labels for of RAM running 64-bit Linux. Valouev and rmapper were all extracted bi-labels. We designate a proximal bi-label found to be a true positive if its true location in the genome is the same as the ran on error corrected data, which is analogous to assem- location of the bi-label to which it is proximal—and false positive bly of sequence reads. Bionano Solve was not because the otherwise. Next, we plotted a histogram showing the distribution of input is required to be specified in their proprietary for - true positives and false positive proximal bi-labels for each bi-label. mat. In addition, for larger genomes, we also ran rmapper We show that high frequency bi-labels i.e. bi-labels for which we find by extracting bi-labels from both directions in an Rmap. more proximal bi-labels produce more precise proximal bi-labels. This justifies filtering low frequency bi-labels We refer to this as rmapper2.0. For all experiments we report the run time (CPU time), peak memory, maximum and mean contig size, genome Graph cleaning and traversal fraction and number of mis-assembled contigs. We note Before traversing the graph, we first pre-process the bi- that genome assembly evaluation tools such as QUAST labelled de Bruijn graph to remove tips and bubbles, [37] cannot be used on optical maps—hence, we design which are common in de Bruijn graphs. Since they limit our own evaluation setup. To compute the genome the size of unary paths (i.e. paths in the graph that con- fraction, we align all assembled contigs to the optical tain nodes with only a single outgoing edge) and do not map reference genome using the alignment method of affect the accuracy of the assembly, it is common prac - Valouev et  al. [25]. The optical map reference genome tice in short read assembly to resolve or remove these is produced by in silico digesting the reference genome structures [33–36]. Tips are produced when errors cause using the same restriction enzyme as used for producing an otherwise unary path to branch at a node and create the Rmaps. For all contigs that were successfully aligned, a short unary path that ends in a terminal node. Bubbles we designate their alignment locations on the reference are created when bi-labels from the same genomic loca- genome as covered and report the percentage of the tion are not merged and included in the graph as sepa- genome covered by at least one contig as the genome rate edges. This generates short unary paths that have the fraction. Any contig which is unable to be aligned by same starting node and the same ending node and are Valouev et  al. is verified to be mis-assembled by align - close in length. ing it to the reference genome using a second alignment Similar to existing short read assemblers, we identify software—Bionano’s RefAligner. The Valouev method all tips and bubbles that have length of at most a speci- aligns an assembled contig to a contiguous stretch of the fied threshold by performing depth first search starting reference optical map that optimizes its alignment score at each node with out-degree greater than one. Hence, and does not tolerate mis-assembled regions, whereas if there exists a tip starting at a given node as well as a RefAligner allows split alignments. Hence, if the align- path of length longer than the specified threshold, then ment outputted from RefAligner is uncontiguous then it the tip is removed by deleting all of its edges starting at is counted as a mis-assembly. the branching node. Furthermore, if there exists a bubble rmapper takes as input four parameters, namely the starting at a given node, we remove one of the edges adja- size k of the k-mers, the minimum distance D between cent to the branching node. We note we do not remove the two k-mers in the bi-label, and the error tolerance M ukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 9 of 13 parameter setting t and t . The k-mer size depends on coli K-12 substr. MG1655 genome and the human ref- f ℓ the rate of added and missed cut-sites in the Rmap data. erence genome GRCh38 (NCBI accession number When the frequency of added and missed cut-sites is GCF_000001405.26) with OMSim [38]. We used enzyme high, the k-mer size needs to be set low so that a good BspQI — a standard, commonly used restriction enzyme percentage of k-mers are error-free. We note that the for optical mapping — and used the default error rate of average error-rate of optical-map data typically lies OMSim, which is a 15% rate of deleted cut sites, and 1 around 17% [30]. Considering that error-correction of added cut site per 100kbp. The resulting E. coli dataset the Rmaps is likely to bring the average error-rate below contains 23450 Rmaps with a mean of 42 fragments per 10% [31], the k-mer size of 6 is the largest value such that Rmap. The human dataset contains 377894 Rmaps with a the probability that an extracted k-mer will be error-free mean of 61 fragments per Rmap. is at least 50% . Hence we use 6 as the default k-mer size in Lastly, we performed experiments using the Rmap our experiments. The best combination of coverage, aver - dataset of the climbing perch (Anabas testudineus) age length of contigs and run-time is achieved by fixing genome generated for the Vertebrate Genomes Project, t = 2000 . We experimented with the following values which consists of 3121480 Rmaps with mean of 28 frag- of D ={15000, 20000, 25000, 30000} and the following ments. A draft assembly of the genome is provided from values of t ={500, 1000, 1500} and for each experiment, the same source which was used to obtain the reference we choose the parameter setting that gives the best per- genome optical map. formance. A higher value of t is needed when the Rmap data still has significant sizing errors after error cor - Impact of parameters rection. A lower value of D is needed when the average We investigated the impact of parameters on assem- Rmap size is small so that we can extract an adequate bly results of E. coli by varying the k-mer size, the number of bi-labels from each Rmap. We show the parameter D (which denotes the length of the skip impact of varying the parameters on the E. coli genome segment, the parameter t , and the parameter t . f l in Section Impact of parameters. We considered the following set of values for these parameters: k ={5, 6, 7} , D ={10000, 15000, 20000} , Datasets t ={250, 500, 1000, 1500} , and t ={1500, 2000, 3000} . f l We performed experiments on both simulated and We show the impact of varying k, D and t in Fig.  5. real Bionano data. We simulated data from both E. The detailed statistics of this experiment are found in Fig. 5 Impact of varying parameters k, D, and t on the assembly of E. coli. For all possible combination of these parameters, we calculated and reported the mean contig size. The blue lines depict a k-mer size 5, the red lines depict a k-mer size 6, and the magenta lines depict a k-mer size 7 Mukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 10 of 13 Table 1 Impact of varying the values of t and t on the assembly results for E. coli data f l t t Run time(s) Peak Memory(Mb) No. of contigs Max Mean f l 250 1500 179 305 4 272 (2.420 Mbp) 267 (2.326 Mbp) 250 2000 200 305 3 271 (2.418 Mbp) 270 (2.351 Mbp) 250 3000 232 305 3 271 (2.419 Mbp) 267 (2.333 Mbp) 500 1500 403 459 10 336 (3.072 Mbp) 295 (2.625 Mbp) 500 2000 445 459 23 529 (4.701 Mbp) 371 (3.252 Mbp) 500 3000 509 459 49 529 (4.711 Mbp) 430 (3.793 Mbp) 1,000 1500 476 628 33 531 (4.745 Mbp) 427 (3.792 Mbp) 1,000 2000 533 629 29 529 (4.746 Mbp) 422 (4.746 Mbp) 1,000 3000 629 630 35 530 (4.742 Mbp) 412 (3.662 Mbp) 1,500 1500 537 705 11 424 (3.732 Mbp) 347 (3.028 Mbp) 1,500 2000 616 709 28 533 (4.778 Mbp) 424 (3.760 Mbp) 1,500 3000 748 711 22 535 (4.764 Mbp) 440 (3.887 Mbp) In this Table, the value of k was fixed to 6, and the value of D was fixed to 15,000. The contig with maximum length (Max) is reported in the number of fragments and the total genomic length in mega base pairs (Mbp). Similarly, the mean contig length (Mean) is also reported in the number of fragments and the total genomic length in mega base pairs Additional file  1: Table  S1. For this experiment, t was unitigs longer than 500 fragments, that are 529 and 522 fixed at 2,000. In Table  1, we show the impact of varing fragments in length, both of which covered the refer- t and t together. For all experiments, contigs longer ence from start to finish. f l than 250 fragments are reported. The experiments The Valouev assembler [24] took 204.8 hours to com - show that for t = 250 the assembly quality is poor. This pute pairwise alignments between all pairs of Rmaps is justified since the average sizing error exceeds 250. and an additional 30 minutes to assemble them into Similarly, for increasing values of D, we see a drop in contigs. It produced 5 contigs with the longest con- the quality. This is because larger values of D create tig of length 102 fragments (corresponding to a 1Mbp fewer number of bi-labels from an Rmap which reduces genomic span). We aligned the assembled contigs back the effective coverage of the data. Among the three k - to the reference and found the total genome coverage to mer sizes used, best assembly quality is achieved with be 48%. Bionano solve produced a high quality assem- k = 6 . This is set as our default k-mer value for all bly, i.e., one contig that spanned 100% of the genome. experiments. The assembly took 48.14 hours of CPU time (59.75 minutes of wall time using 60 CPUs in parallel) and Performance on E. coli peak memory of 1.18 GB. The Valouev aligner reported For the E. coli Rmap dataset, error correction took alignments for all contigs, hence we report zero mis- 2.66 hours of CPU time. The assembly results are sum - assembled contigs for all three methods. marized in Table  2. For this experiment we extracted In summary, the quality of Bionano Solve and bi-labels with k = 6 and D = 15000 and used error rmapper were comparable, yet rmapper was 480 times tolerance parameter setting t = 500 and t = 2000 . faster (6 minutes versus 2889 minutes) and used less f ℓ rmapper took 342 seconds and peak memory of 274 than 500 Mb of memory. Mb to assemble the data. The assembler produced two Table 2 Assembly results for E. coli Rmap data simulated by OMSim using enzyme BspQI Assembler Run time Peak Memory No. of contigs Max Mean GF(%) MA Valouev 8.5 d 0.48 5 102 (1.0 Mbp) 56 (0.5 Mbp) 48 0 Solve 48.1 h 1.18 1 631 (4.9 Mbp) 631 (4.9 Mbp) 100 0 rmapper 6 m 0.46 2 529 (4.6 Mbp) 526 (4.5 Mbp) 100 0 The dataset has 23,450 Rmaps of mean size of 42 fragments and coverage of 900x. The peak memory is given in gigabytes (GB). The run time is reported in second (s) minutes (m), hours (h) and days (d). rmapper was run with k = 6 , D = 15000 and error tolerance parameter setting t = 500 and t = 2000 . The contig with maximum f ℓ length (Max) is reported in the number of fragments and the total genomic length in mega base pairs (Mbp). Similarly, the mean contig length (Mean) is also reported in the number of fragments and the total genomic length in mega base pairs. The genome fraction (GF) is the percentage of the genome that is covered by at least one contig. Lastly, the number of mis-assembled contigs (MA) is given M ukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 11 of 13 Table 3 Assembly results for human Rmap data simulated by OMSim using enzyme BspQI Assembler Run time Peak Memory No. of contigs Max Mean GF(%) MA Valouev > 360 d n/a n/a n/a n/a n/a n/a Solve 122.4 d 94.8 169 14,133 (124.6 Mbp) 2,036 (16.4 Mbp) 93.8 4 rmapper 12.1 h 7.9 3865 1,380 (14.4 Mbp) 144 (1.4 Mbp) 95.8 0 rmapper 2.0 22.2 h 18.8 3524 1,752 (18.5 Mbp) 203 (2.0 Mbp) 96.7 0 The dataset has 377894 Rmaps of mean size of 61 fragments and coverage 80x. See Table 2 for a description of the assembly statistics and notation. As described in the text, rmapper 2.0 extracts bi-labels from Rmaps in both forward and reverse directions Performance on human contigs than rmapper , Bionano Solve produced the long- For the human Rmap dataset, error correction took est contigs but covered 93.8% of the genome and had 4 1339.31 seconds of wall time running cOMet in parallel mis-assembled contigs. In addition, rmapper2.0 has the on 2000 CPUs (corresponding to 524 hours of CPU time). highest genome fraction, which is 96.7%. Lastly, rmapper The assembly results are shown in Table  3. For this exper- and rmapper2.0 was 242 and 132 times faster than Solve, iment we extracted bi-labels with k = 6 and D = 25000 respectively, and used 5 times less memory. and used error tolerance parameter setting t = 1500 and t = 2000 . rmapper took 12.1 hours and peak memory Performance on climbing perch of 7.9 GB to assemble the data whereas rmapper 2.0 took Error correction of the the climbing perch (Anabas Tes- 22.2 hours and 18.8 GB of peak memory. rmapper pro- tudineus) Rmap dataset took 1.84 hours of wall time duced 3134 contigs whereas rmapper 2.0 produced 2867 running cOMet in parallel on 3000 CPUs (correspond- contigs. The maximum size unitig produced by rmapper ing to 2042 hours of CPU time). The assembly results and rmapper2.0 was 1380 and 1752 fragments in length, are shown in Table  4. For this experiment we extracted respectively. Lastly, rmapper achieved a net coverage of bi-labels with k = 6 and D = 15000 and used error tol- 95.8% while rmapper 2.0 was able to cover 96.7% of the erance parameter setting t = 1500 and t = 2000 . f ℓ genome—both with zero mis-assembled contigs. rmapper took 7.5 hours and peak memory of 9.7 GB to The Valouev assembler did not produce any output assemble the data whereas rmapper 2.0 took 14.9 hours after 360 CPU days so n/a is reported in Table 3. Bionano and 18.77 GB of peak memory. rmapper produced 4573 Solve produced comparably fewer but longer contigs to contigs whereas rmapper 2.0 produced 4972 contigs. The rmapper but had 4 mis-assembled contigs. In addition, maximum size unitig produced by rmapper and rmapper it took approximately 2937 CPU hours (55 hours of wall 2.0 was 217 and 294 fragments in length, respectively. time using 60 CPUs in parallel) and peak memory of 94.8 Lastly, rmapper achieved a genome fraction of 92.07%, GB. It is also worth noting that Bionano Solve performs while rmapper 2.0 was able to cover 95.05% of the an elaborate scaffolding and stitching of contigs, which genome. Both rmapper and rmapper 2.0 produced zero explains the relatively few number of contigs but higher mis-assemblies. mis-assembly rate. The scaffolding and stitching cannot The Valouev assembler did not halt on this dataset be decoupled from the assembly since Bionano only dis- after 360 CPU days so we do not report any results. Solve tributed a single executable that runs both. The source halted with a fatal error message in its final scaffolding code is not publicly available. step after 156 CPU days (93 hours of wall time using 60 In summary, the Valouev assembler did not scale to CPUs in parallel) and using a peak memory of 16 GB. We the human genome, rmapper2.0 produced slightly longer used the latest assembly result produced by the software Table 4 Assembly results for the Rmap dataset of the climbing perch genome Assembler Run time Peak Memory No. of contigs Max Mean GF(%) MA Solve 156 d 16 Gb 907 1032 (8.4 Mbp) 104 (7.9 Mbp) 97.6 5 rmapper 7.5 h 9.7 4573 217 (1.6 Mbp) 32 (0.28 Mbp) 92.07 0 rmapper2.0 14.9 h 18.8 4972 294 (2.4 Mbp) 42 (0.4 Mbp) 95.05 0 The data was generated for the Vertebrate Genomes Project and it consists of 3121480 Rmaps with mean size of 28 fragments. The restriction enzyme used in the experiment is BspQI. See Table 2 for a description of the assembly statistics and notation. As described in the text, rmapper 2.0 extracts bi-labels from Rmaps in both forward and reverse directions. Bionano Solve halted with a fatal error message in its final scaffolding step. We used the latest assembly result produced by the Solve in order to compare their assembly quality Mukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 12 of 13 in order to compare their assembly quality. Similar to the regions in the graph, however, we believe that there is a human assembly, Bionano Solve produced comparably great opportunity to improve the length of the assem- fewer and longer contigs to rmapper and had a genome bled optical maps by devising an algorithm to extend the coverage of 97.8%—but had 5 mis-assembled contigs. traversal. Next, we hypothesize that by adapting meth- In summary, the Valouev assembler did not scale to the ods designed for scaffolding and stitching optical map - human genome, rmapper2.0 produced slightly longer ping data [39, 40], the length of the assembled optical contigs than rmapper , Bionano Solve produced the long- maps can be improved. Lastly, we note that there does est contigs and covered 97.6% of the draft genome but not exist a method to evaluate optical map assembles like had 5 mis-assembled contigs. rmapper2.0 has compara- there does for genome assemblies—QUAST [37] being ble genome coverage to Solve, which is 95.05% — while the well-known genome assembly evaluation method. running 251 times faster. Furthermore, although some of the metrics of genome assembly evaluation tools (e.g., mean contig length and length of the longest contig) trivially extend to optical Discussion and future work map assemblies, metrics that require sequence alignment We implement our approach and show its performance to a reference genome (e.g., number of mis-assemblies) on multiple simulated and real datasets. Our experimen- do not extend and need redevelopment. tal results show the only non-proprietary method (i.e. by Valouev et  al. [24]) is unable to scale to the human and fish genomes, and that our method is at least 130 times Conclusion faster than Bionano Solve and its memory usage is less Assembly of Rmap data is a fundamental problem in than 20% of the memory usage of Bionano Solve. We optical mapping that still remains in a nascent stage— point out that there is a trade-off between the length of as prior to this work, there was only a single other non- the contigs, the genome fraction, and number of mis- proprietary assembler. In this paper, we formulate and assemblies. Analogous to assembly of short reads, ideally describe the first de Bruijn graph approach for Rmap an assembler should return a small number of contigs assembly by redefining the de Brujn graph to adapt it to or scaffolds which cover the entire genome and have no Rmap data. We accomplish this by extending the defini - mis-assembled regions. In the case of the human and fish tion of a bi-label introduced in the context of the paired- data, Solve was able to produce fewer and longer scaf- end de Bruijn graph by Medvedev et  al. [28]. We refer folds than rmapper but produced more mis-assemblies to our modified de Bruijn graph as the bi-labelled de than rmapper . Conversely, for the human data, rmapper Bruijn graph and demonstrate how to efficiently build produced contigs that covered a larger fraction of the and store it using a two-tiered orthogonal range search genome with no mis-assembled regions. This highlights data-structure. one trade-off in Rmap assembly. Hence, there is an We implement this approach, leading to a novel Rmap opportunity to improve Rmap assemblers so that this gap assembler that we call rmapper . We compare the per- between Solve and rmapper is closed. Another impor- formance of our method with the assembler of Valouev tant note about the comparison between the assemblers et  al., and Bionano Solve on three genomes of varying is that rmapper has a very simple traversal algorithm and size: E. coli, human, climbing perch (a fish species from does not use any sort of scaffolding. This is due to the fact the Vertebrate Genomes Project). Our comparison dem- that the main contribution of this work is formulating onstrates that rmapper was more than 130 times faster and solving the problem of assembly of Rmaps. Bionano and used less than five times less memory than Solve, Solve has a scaffolding algorithm that cannot be decou - and was more than 2,000 times faster than Valouev et al.. pled from the assembly step since only an executable is Consequently, rmapper successfully assembled the 3.1 available. Thus, the results really compare rmapper ’ s unit- million Rmaps of the climbing perch genome into con- igs with Solve’s scaffolds, and rmapper is still comparable. tigs that covered over 95% of the draft genome with zero This work presents the first non-proprietary Rmap mis-assemblies. assembler developed in the past decade, and thus, opens the door for improving Rmap assembly. Thus, there are Supplementary Information many related problems and possible improvements that The online version contains supplementary material available at https:// doi. warrant future research. First, the main contribution of org/ 10. 1186/ s13015- 021- 00182-9. our work was adapting the de Bruijn graph to Rmap data. For completeness, we perform depth first search to trav - Additional file 1: Table S1. Impact of varying the values of k, D and t on erse the bi-labelled de Bruijn graph and extract contigs. the assembly results for E. coli data. Our traversal does not attempt to reconcile complicated M ukherjee et al. Algorithms Mol Biol (2021) 16:6 Page 13 of 13 Acknowledgements 15. Dong Y, et al. Sequencing and automated whole-genome optical map- Not applicable ping of the genome of a domestic goat (Capra hircus). Nat Biotechnol. 2013;31:135. Authors’ contributions 16. Mukherjee K, Alipanahi B, Kahveci T, Salmela L, Boucher C. Aligning opti- KM developed the software and carried out all experiments. All authors con- cal maps to de Bruijn graphs. Bioinformatics. 2019;35(18):3250–6. tributed towards the design of the algorithm and writing the manuscript. All 17. Muggli MD, Puglisi SJ, Boucher C. Efficient indexed alignment of contigs authors read and approved the final manuscript. to optical maps; 2014. pp. 68–81 18. Muggli MD, Puglisi SJ, Boucher C. A Succinct Solution to Rmap Align- Funding ment. In: 18th International Workshop on Algorithms in Bioinformatics This work was supported by NSF IIS (Grant No. 1618814) and Academy of ( WABI 2018), vol. 113; 2018. pp. 12–11216. Finland (Grants 308030, 335553, and 323233 to LS). 19. Muggli MD, Puglisi SJ, Boucher C. Kohdista: an efficient method to index and query possible rmap alignments. Algorithms Mol Biol. 2019;14:25. Availability of data and materials 20. Leung AK-Y, Kwok T-P, Wan R, Xiao M, Kwok P-Y, et al. Omblast: alignment Our software, rmapper is written in C++ and is publicly available under GNU tool for optical mapping using a seed-and-extend approach. Bioinformat- General Public License at https:// github. com/ kingu fl/ Rmapp er . Additional ics; 2016. 620. data can be accessed from the Github repository. 21. Mendelowitz LM, Schwartz DC, Pop M. Maligner: a fast ordered restriction map aligner. Bioinformatics. 2016;32(7):1016–22. 22. Verzotto D, et al. Optima: Sensitive and accurate whole-genome align- Declarations ment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis. GigaScience. 2016;5(1):2. Ethics approval and consent to participate 23. Anantharaman TS, Mishra B, Schwartz DC. Genomics via optical mapping Not applicable. iii: Contiging genomic DNA and variations (extended abstract). New York: AAAI Press; 1997. p. 18–27. Consent for publication 24. Valouev A, Schwartz DC, Zhou S, Waterman MS. An algorithm for assem- Not applicable. bly of ordered restriction maps from single dna molecules. Proc Natl Acad Sci USA. 2006;103(43):15770–5. Competing interests 25. Valouev A, et al. Alignment of optical maps. J Comp Biol. The authors declare that they have no competing interests. 2006;13(2):442–62. 26. Idury RM, Waterman MS. A new algorithm forDNA sequence assembly. J Author details Comput Biol. 1995;2(2):291–306. Department of Computer and Information Science and Engineering, Herbert 27. Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA Wertheim College of Engineering, University of Florida, Gainesville, USA. fragment assembly. Proc Natl Acad Sci. 2001;98(17):9748–53. Department of Computer Science, Helsinki Institute for Information Technol- 28. Medvedev P, Pham S, Chaisson M, Tesler G, Pevzner P. Paired de Bruijn ogy, HIIT, University of Helsinki, Helsinki, Finland. graphs: a novel approach for incorporating mate pair information into genome assemblers. J Comput Biol. 2011;18:1. Received: 19 January 2021 Accepted: 13 April 2021 29. Li M, et al. Towards a more accurate error model for BioNano optical maps. In: ISBRA 2016; 2016. pp. 67–79. 30. Chen P, Jing X, Ren J, Cao H, Hao P, Li X. Modelling BioNano optical data and simulation study of genome map assembly. Bioinformatics. 2018;34(23):3966–74. References 31. Mukherjee K, Washimkar D, Muggli MD, Salmela L, Boucher C. Error cor- 1. Schwartz DC, Li X, Hernandez LI, Ramnarain SP, Huff EJ, Wang Y-K. Ordered recting optical mapping data. GigaScience. 2018;7:1. restriction maps of saccharomyces cerevisiae chromosomes constructed 32. Bentley JL. Multidimensional binary search trees used for associative by optical mapping. Science. 1993;262:110–4. searching. Commun ACM. 1975;18(9):509–17. 2. Li L, et al. OMSV enables accurate and comprehensive identification of 33. Bankevich A, et al. SPAdes: a new genome assembly algorithm and its large structural variations from nanochannel-based single-molecule opti- applications to single-cell sequencing. J Comput Biol. 2012;19(5):455–77. cal maps. Genome Biol. 2017;18(1):230. 34. Zerbino DR, Birney E. Velvet: Algorithms for de novo short read assembly 3. Fan X, Xu J, Nakhleh L. Detecting large indels using optical map data. In: using de Bruijn graphs. Genome Res. 2008;18(5):821–9. RECOMB-CG. LNCS, vol. 11183, pp. 108–127. Springer, 2018. 35. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I. ABySS: 4. Ganapathy G, et al. De novo high-coverage sequencing and annotated a parallel assembler for short read sequence data. Genome Res. assemblies of the budgerigar genome. GigaScience. 2014;3:11. 2009;19(6):1117–23. 5. Chamala S, et al. Assembly and validation of the genome of the non- 36. Peng Y, Leung HC, Yiu S-M, Chin FY. IDBA-UD: A de novo assembler for model basal angiosperm amborella. Science. 2013;342(6165):1516–7. single-cell and metagenomic sequencing data with highly uneven 6. Teague B, et al. High-resolution human genome structure by single- depth. Bioinformatics. 2012;28(11):1420–8. molecule analysis. Proc Natl Acad Sci USA. 2010;107(24):10848–53. 37. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool 7. Muggli MD, Puglisi SJ, Ronen R, Boucher C. Misassembly detection using for genome assemblies. Bioinformatics. 2013;29(8):1072–5. paired-end sequence reads and optical mapping data. Bioinformatics. 38. Miclotte G, Plaisance S, Rombauts S, Van de Peer Y, Audenaert P, et al. 2015;31(12):80–8. OMSim: a simulator for optical map data. Bioinformatics. 2017;1:2740–2. 8. Pan W, Lonardi S. Accurate detection of chimeric contigs via BioNano 39. Pan W, Jiang T, Lonardi S. OMGS: optical map-based genome scaffolding. optical maps. Bioinformatics. 2018;35(10):1760–2. J Comput Biol. 2020;27(4):519–33. 9. Reslewic S, et al. Whole-genome shotgun optical mapping of Rho- 40. Shelton JM, Coleman MC, Herndon N, Lu N, Lam ET, Anantharaman dospirillum Rubrum. Appl Environ Microbiol. 2005;71(9):5511–22. T, Sheth P, Brown SJ. Tools and pipelines for BioNano data: molecule 10. Zhou S, et al. A whole-genome shotgun optical map of Yersinia pestis assembly pipeline and fasta super scaffolding tool. BMC Genomics. strain KIM. Appl Environ Microbiol. 2002;68(12):6321–31. 2015;16(1):734. 11. Zhou S, et al. Shotgun optical mapping of the entire leishmania major Friedlin genome. Mol Biochem Parasitol. 2004;138(1):97–106. Publisher’s Note 12. Zhou S, et al. Validation of rice genome sequence by optical mapping. Springer Nature remains neutral with regard to jurisdictional claims in pub- BMC Genom. 2007;8(1):278. lished maps and institutional affiliations. 13. Zhou S, et al. A single molecule Scaffold for the Maize Genome. PLoS Genet. 2009;5:1000711. 14. Church DM, et al. Lineage-specific biology revealed by a finished genome assembly of the mouse. PLoS Biol. 2009;7(5):1000112.

Journal

Algorithms for Molecular BiologySpringer Journals

Published: May 25, 2021

There are no references for this article.