Access the full text.
Sign up today, get DeepDyve free for 14 days.
Hind Alhakami, H. Mirebrahim, S. Lonardi (2017)
A comparative evaluation of genome assembly reconciliation toolsGenome Biology, 18
Haibao Tang, Xingtan Zhang, Chenyong Miao, Jisen Zhang, R. Ming, James Schnable, P. Schnable, E. Lyons, Jian-guo Lu (2015)
ALLMAPS: robust scaffold ordering based on multiple mapsGenome Biology, 16
NI Weisenfeld, V Kumar, P Shah, DM Church, DB Jaffe (2017)
Direct determination of diploid genome sequencesGenome Res, 27
M. Goel, Hequan Sun, Wen-Biao Jiao, Korbinian Schneeberger (2019)
SyRI: finding genomic rearrangements and local sequence differences from whole-genome assembliesGenome Biology, 20
G. Marçais, A. Delcher, A. Phillippy, Rachel Coston, S. Salzberg, A. Zimin (2018)
MUMmer4: A fast and versatile genome alignment systemPLoS Computational Biology, 14
Wen-Biao Jiao, G. Accinelli, Benjamin Hartwig, C. Kiefer, David Baker, E. Severing, Eva-Maria Willing, Mathieu Piednoel, Stefan Woetzel, Eva Madrid-Herrero, B. Huettel, Ulrike Hümann, R. Reinhard, M. Koch, Daniel Swan, Bernardo Clavijo, G. Coupland, Korbinian Schneeberger (2017)
Improving and correcting the contiguity of long-read genome assemblies of three plant species using optical mapping and chromosome conformation capture data.Genome research, 27 5
M. Grötschel, M. Jünger, G. Reinelt (1984)
A Cutting Plane Algorithm for the Linear Ordering ProblemOper. Res., 32
M. Goel, Hequan Sun, Wen-Biao Jiao, Korbinian Schneeberger (2019)
SyRI: finding genomic rearrangements and local sequence differences from whole-genome assembliesbioRxiv
The Initiative (2000)
Analysis of the genome sequence of the flowering plant Arabidopsis thalianaNature, 408
N. Weisenfeld, Vijay Kumar, Preyas Shah, D. Church, D. Jaffe (2016)
Direct determination of diploid genome sequencesGenome Research, 27
(Dondi R, Sikora F. The longest run subsequence problem: Further complexity results. arXiV 2020. arXiv:2011.08119.)
Dondi R, Sikora F. The longest run subsequence problem: Further complexity results. arXiV 2020. arXiv:2011.08119.Dondi R, Sikora F. The longest run subsequence problem: Further complexity results. arXiV 2020. arXiv:2011.08119., Dondi R, Sikora F. The longest run subsequence problem: Further complexity results. arXiV 2020. arXiv:2011.08119.
R. Dondi, F. Sikora (2020)
The Longest Run Subsequence Problem: Further Complexity Results
Lauren Coombe, Vladimir Nikolić, Justin Chu, I. Birol, R. Warren (2020)
ntJoin: Fast and lightweight assembly-guided scaffolding using minimizer graphsBioinformatics, 36
Michael Alonge, Sebastian Soyk, Srividya Ramakrishnan, Xingang Wang, Sara Goodwin, F. Sedlazeck, Z. Lippman, M. Schatz (2019)
RaGOO: fast and accurate reference-guided scaffolding of draft genomesGenome Biology, 20
Johannes Köster, S. Rahmann (2018)
Snakemake - a scalable bioinformatics workflow engineBioinformatics, 34 20
H. Togt (2003)
Publisher's NoteJ. Netw. Comput. Appl., 26
Joshua Burton, Andrew Adey, Rupali Patwardhan, R. Qiu, J. Kitzman, J. Shendure (2013)
Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactionsNature Biotechnology, 31
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations
Genome assembly is one of the most important problems in computational genomics. Here, we suggest address- ing an issue that arises in homology-based scaffolding, that is, when linking and ordering contigs to obtain larger pseudo-chromosomes by means of a second incomplete assembly of a related species. The idea is to use alignments of binned regions in one contig to find the most homologous contig in the other assembly. We show that ordering the contigs of the other assembly can be expressed by a new string problem, the longest run subsequence problem (LRS). We show that LRS is NP-hard and present reduction rules and two algorithmic approaches that, together, are able to solve large instances of LRS to provable optimality. All data used in the experiments as well as our source code are freely available. We demonstrate its usefulness within an existing larger scaffolding approach by solving realistic instances resulting from partial Arabidopsis thaliana assemblies in short computation time. Keywords: Alignment, Assembly, String algorithm, Longest subsequence genetic maps, physical maps or cytological maps pro- Introduction viding markers with known positions in the genome Genome assembly from sequencing reads enables the and known distances between each other [3]. The other analysis of an organism at its genome level and is one approach is to use long-range genomic information to of the most important problems in computational link multiple contigs and put them into correct order and genomics. The first step is usually to assemble the orientation. Prominent examples are linked barcoded reads based on overlap- or k-mer-based approaches to reads like 10X sequencing [4], Hi-C data based on chro- create contigs, which then need to be put into correct matin conformation capture [5] and optical mapping [6]. order and orientation in a scaffolding phase to gener- Yet another way for contig scaffolding is to use two or ate the final assembly. The presence of a high-quality more incomplete assemblies from closely related samples chromosome-level reference genome of the same spe- [7]. Regions of unconnected contigs for one sample might cies can significantly simplify assembly generation as it be connected with the help of another, related sample, can be used as a template to order these contigs [1, 2]. e.g., a genome assembly of an individual of the same spe- However, for many species, such a reference genome is cies, providing an overall gain in information for both not available. samples. Local similarities between contigs from different There are two commonly used approaches for scaffold - samples can be used to align and order them. Ideally, this ing. First, different types of maps provide anchors for leads to long chromosome-like sequences called pseudo- the contigs in the genome. These could be, for example, chromosomes, where the contigs of different samples are aligned like shingles next to each other, as illustrated in *Correspondence: gunnar.klau@hhu.de Fig. 1(a). Note that this setting differs from the problem Algorithmic Bioinformatics, Heinrich Heine University Düsseldorf, Düsseldorf, Germany of assembly reconciliation [8], where the task is to build Full list of author information is available at the end of the article © The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Schrinner et al. Algorithms Mol Biol (2021) 16:11 Page 2 of 11 (a) Assembly 1 Contigs Assembly 2 homology-based contig joining Assembly 1 Pseudo- chromosomes Assembly 2 (b) Contig_A1 Contig_B1 Contig_B2Contig_B3 Fig. 1 Homology-based scaffolding. a Independent initial assemblies (contigs), which are joined into pseudo-chromosomes by using homologies between contigs for scaffolding. b Alignments between contigs from different samples. A1 determines the order of B1, B2 and B3 a consensus assembly from two or multiple input assem- well as mutations between the samples can cause some blies from the same species but which does not make use bins to map onto a “wrong” contig, i.e., a contig belong- of homology information from different species. ing to a different area than the bin. Therefore a method to Note that structural rearrangements such as transloca- find the best partition of A needs to distinguish between tions or inversions and repeat regions between genomes actual transitions from one B-contig to another and noise can result in non-sequential and non-unique mappings introduced by errors or mutations. within contigs and can thus lead to misleading connec- Figure 2 illustrates the different steps in solving this tions between contigs. These events need to be con - problem. Starting from a binned contig from A, here A sidered when finding homologous contigs as shown in for illustration, and its mapping preferences to the unor- Fig. 1(b). dered contigs in B, we reformulate this ordering problem In the simplest setting of two incomplete assemblies as a string problem. In essence, we want to find the long - we are given two sets of contigs A ={A , . . . , A } and est subsequence of the input string of mapping prefer- B ={B , . . . ,B } computed from two different sam - ences that consists only of consecutive runs of contigs 1 m ples. As already stated, the contigs are not ordered with from B where each such run may occur at most once. This respect to genome positions, and it is this order we rather subsequence corresponds to an ordering of the contigs in want to compute. More precisely, we aim at inferring the B, which can be transferred to the original problem. most likely order from between-sample overlaps among In this paper we formalize this process and introduce the contigs. the Longest Run Subsequence problem (LRS). We show Assuming we want to order the contigs in B, we first that LRS is NP-hard. Nevertheless, we want to solve large map the contigs A , . . . , A against the contigs of B, divide instances of LRS to provable optimality in reasonable every contig A of A into smaller, equally sized chunks, running time and therefore present a number of reduc- called bins and determine the best matching contig in tion rules and two algorithms based on integer linear pro- B for every bin after. If A actually overlaps with multi- gramming and dynamic programming, respectively. We ple contigs in B, we should be able to partition A into evaluate both approaches on synthetic instances and find smaller parts based on mapping the bins to different that they show complementary strengths regarding the contigs in B. However, sequencing or mapping errors as number of runs and the alphabet size. We also test our S chrinner et al. Algorithms Mol Biol (2021) 16:11 Page 3 of 11 Contig A (binned) Contig A (binned) 1 1 B B B B B B B 1 2 3 4 1 3 2 Unordered B-contigs Ordered B-contigs LRS b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b S: S : 1 1 4 1 1 1 3 3 3 1 3 2 2 2 3 1 1 4 1 1 1 3 3 3 1 3 2 2 2 3 Fig. 2 Processing of a single contig A . The bins are matched against all contigs of another sample B. Solving Longest Run Subsequence (LRS) on the corresponding string S, yields a maximal subsequence with at most one run for every contig. This induces an optimal order for a subset of B-contigs approaches on realistic instances within the initial scaf- simple partition, which is not skewed by wrong mappings folding phase of the SyRI package [7]. The test instances of single bins. We therefore restrict valid partitions of A occurred during assembly of Arabidopsis thaliana sam- to contain at most one contiguous part for every contig ples and could not be solved by a brute-force method. We in B. This prevents large parts to be interrupted by single show that all those instances can be solved within short mismatching bins, at the cost of not being able to capture computation time. Our code and all data used in the short-ranged translocations as seen in Fig. 1(b). A parti- experiments are freely available at https:// github. com/ tion can be represented as a subsequence S of the string AlBi- HHU/ longe st- run- subse quence. The software can S, which only contains at most one run for every σ ∈ � . also be installed with pip as a module from https:// pypi. The runs in S are the parts corresponding to one B-con- org/ proje ct/ longe strun subse quence/ or as “longestrun- tig each, while the dropped characters from S are bins in subsequence” from bioconda and can thus easily be used conflict with S . Finding the best partition can thus be within larger scaffolding packages. stated as the following optimization problem: Problem 1 (Longest Run Subsequence, LRS) Given Problem formulation an alphabet � ={σ , . . . , σ } and a string s = s , . . . , s 1 |�| 1 m A string S = s , . . . , s is a sequence over characters from 1 m ′ ′ ′ with s ∈ , find a longest subsequence S = s , . . . , s of a finite alphabet . A subsequence of S is a sequence S, such that S contains at most one run for every σ ∈ � . s , . . . , s , such that 1 ≤ i < i < . . . < i ≤ m . W e i i 1 2 k That is, for every pair of positions i and j with i < j , it denote the substring s , . . . , s of S as S[i, j] and k consecu- i j holds that tive occurrences of a character σ inside a string S as σ ′ ′ ′ ′ and call it a run. Let σ(r) be the character of the run r and s = s ⇒ s = s for all i < l < j. i j l i L(r) its length. By summarizing the characters of S into We denote the length of an optimal LRS solution for S maximally long runs, S can be represented as a unique L(r ) L(r ) 1 n with LRS(S) . Since we want to maximize the length of the sequence of runs r , . . . , r = σ(r ) . . . σ (r ) . 1 n 1 n run subsequence, it is always beneficial to either completely For every σ ∈ � we define P (i) as the index of the last add or completely remove a run of S. Once a character run before r containing σ in S (0 if it does not occur). As k ′ s ∈ from a run s is added to s , there can never be any an example, the string from Fig. 2 can be compressed to i 2 3 3 3 1 1 3 1 other occurrence of s outside this run. Thus, the entire run b b b b b b b b with P (4) = 3 , P (3) = 1 i 1 4 1 3 1 3 2 3 b b 1 1 must be added to s to achieve maximum length. We will and P (1) = 0. therefore mainly refer to runs instead of single characters. We propose to model the optimal partition of a sin- gle contig as a string optimization problem. Formally, we use the contigs from set B as the alphabet, that is Complexity � ={b , . . . , b } and write the contig A as a string 1 m i In this section we prove hardness of the Longest Run S = b . . . b over by replacing the bins of A with the i i i 1 m Subsequence problem. More precisely, we show that corresponding character of the best match from B. On dLRS, the decision version of the problem is NP-com- the one hand, we want every single bin to be assigned to plete. An instance of dLRS is given by a tuple (S, k) and its preferred contig in B, but, on the other, we also want a Schrinner et al. Algorithms Mol Biol (2021) 16:11 Page 4 of 11 consists in answering the question whether S has a long- For the transformation, we define using three different est run subsequence of length at least k. types of characters: Theorem 1 dLRS is NP-complete. 1 Separators $ for every vertex v ∈ V . i i 2 Edge signs E for every pair v , v ∈ V . Note that {i,j} i j Proof It is easy to see that dLRS is in NP, because it can E = E . {i,j} {j,i} be checked in polynomial time whether a string s is a 3 Triangle signs � for every triangle in G. (i,j,k) ′ ′ solution, that is, s is a run subsequence and |s |≥ k. Note that triangles between three vertices have an orientation and can be rotated. Therefore To prove NP-hardness, we reduce from the Linear Order- � = � = � �= � = � = � (i,j,k) (j,k ,i) (k ,i,j) (i,k ,j) (k ,j,i) (j,i,k) ing Problem (LOP), which has been shown to be NP-hard . [9]. LOP takes a complete directed graph with edge weights and no self-loops as input and looks for an ordering among On the highest level the string S is constructed as shown the vertices, such that the total weights of edges following in Equation 2. It consists of one large block per vertex, this order (i.e., edges leading from lower ordered vertices to each of them separated by a run of the associated separa- higher ordered vertices) is maximized. tion sign of length M. edge block for (v , v ) 1 2 (2) M M M S = [EB] [EB] . . . [EB] $ [EB] . . . [EB] $ . . . $ [EB] . . . [EB] 1,2 1,3 1,n 2,1 2,n n,1 n,n−1 1 2 n−1 vertex block for v Each vertex block consists of a series of edge blocks (EB), We show that dLOP, the decision problem of LOP, that which we define as follows: is, the question whether a vertex ordering exists whose w +w ′ ′ w +w ij sum ij sum M M weight is at least a given threshold, can be polynomially [EB] = E � . . . � E (3) i,j (i,j,1) (i,j,n) {i,j} {i,j} reduced to dLRS. Let G = (V , E) be a complete digraph |V |= n (v , v ) ∈ E with . We denote the weight of with i j In the same way as the i-th vertex block is associated with w w and the sum of all weights of G as . Without loss ij sum vertex v , the edge substrings in it are associated with the of generality we can assume that all edge weights are outgoing edges of v . Note that there is one EB missing in positive: The number of edges following a linear order every vertex block, as self-loops are not allowed. Finally, is fixed, so adding a sufficiently large offset to all weights [EB] contains all triangle signs for triangles, in which i,j only adds a fixed value to any solution without changing (v , v ) occurs, i.e., {� | 1 ≤ k ≤ n, k �= i, k �= j} , i j (i,j,k) the core problem. This allows us to characterize LOP as which, for the sake of notation, is written as ′ ′ M M finding an acyclic subgraph G with maximum weight, � . . . � in Eq. 3. The triangle signs are padded (i,j,1) (i,j,n) because the non-negativity of the weights always forces by edge signs for (v , v ) . Every edge sign E occurs only i j {i,j} (v , v ) (v , v ) either or to be in G for every pair of vertices i j j i in the two edge blocks [EB] and [EB] . The length of i,j j,i v , v ∈ V i j the edge sign runs depends on the weight of the corre- sponding edge (in either direction), rewarding the higher The proof consists of two parts. First, we show how to weighted edge. We also add w to the length of every sum transform G into a string S. Second, we show that G has a edge sign run E . {i,j} LOP solution of weight k if and only if S has a LRS of size As for the numbers M and M , the latter is chosen to be n(n − 1)(n − 2) f (k) :=(n − 1) · M + · larger than the combined length of all edge sign runs. This makes a single triangle sign run more profitable than M + n(n − 1) · w + 2k sum (1) any selection of edge sign runs. In the same manner, M is ′ 2 ′ 3 chosen to be larger than all triangle sign runs combined. with M := 4n · w and M := M · n . sum S chrinner et al. Algorithms Mol Biol (2021) 16:11 Page 5 of 11 ′ ′ In summary, the decision problem whether there is a Using this construction, a valid solution G = (V , E ) for a solution for a dLOP instance (G, k) can be reduced to dLOP instance (G, k), i.e., an acyclic subgraph of G with the decision problem whether a solution for the dLRS total weight of at least k, can be transformed into a valid instance (S, f (k)) obtained from G exists. solution for a dLRS instance (S, f (k)) . First, all separa- G tion runs are selected, yielding a total length of (n − 1) · M . Second, for every edge in E , all edge signs in the corresponding edge blocks are selected. Since Solution strategies ′ n(n−1) n(n−1) To solve instances of LRS in practice we propose three |E |= , this adds at least 2 · · w + k sum 2 2 ′ reduction rules and two algorithmic approaches. As of The - characters to the solution. Finally, G is acyclic, so for orem 1 we cannot guarantee a polynomial running time. every triangle in G, there is at least one edge missing in G . Thus, by construction of S, one run can be selected for Reduction rules every triangle sign without interfering with the edge sign n(n−1)(n−2) In Sec. "Problem formulation" we already pointed out that runs, adding the missing · M characters. an optimal solution for LRS always selects complete runs of ′ characters and we reduced the notation of the input to runs Given a solution S for the dLRS instance (S, f (k)) , ′ of characters with a certain length each. This can also be we show how to obtain a subgraph G of total weight at seen as a reduction rule to the original problem formulation least k for the original dLOP instance. The subsequence ′ as the remaining size of the solution space now depends on S must contain all separation runs and a run for every the number of runs n instead of the actual string length m. triangle sign, because without all separation and trian- Two more reduction rules rely on the following lemma: gle signs selected at some place, it is (by choice of M and M ) impossible to reach length f (k) for any k. There - Lemma 1 Let S, T be two strings over the disjoint alpha- fore every selected edge sign run belongs to a single edge bets and . Then the optimal LRS solutions for S and S T block of a solution of dLRS. The idea is that the choice T can be concatenated to form an optimal solution for the of selecting E either in [EB] or [EB] corresponds to {i,j} i,j j,i ′ concatenated string ST. the choice of having either (i, j) or (j, i) in the DAG G for the original LOP. Since we added w to the length sum n(n−1) of every edge sign run and there are only edge Proof Since the two alphabets are disjoint, an LRS signs in total (with n being the number of vertices in G), solution for S does not contain any characters from . S must contain both runs inside an edge block, in order T Therefore the choice of the subsequence for S does not to reach length n(n − 1) · w (the third summand in sum influence the valid subsequences for T and vice versa. This f (k) ). u Th s, either edge signs or triangle signs may be means that optimal solutions for S and T can be computed selected inside an edge block, but not both. G is finally independently and concatenated to form a valid solution obtained by selecting an edge e if and only if the edge for ST. Obviously, an optimal solution for ST cannot be sign runs in the corresponding edge block are selected. n(n−1) longer than the combined length of optimal solutions for This yields edges with a total weight of at least k. S and T, otherwise the latter would not be optimal. For every vertex pair v , v , exactly one of the edges (v , v ) i j i j and (v , v ) is selected, because their corresponding edge j i According to Lemma 1 we can divide an LRS instance blocks share the same edge sign. S into smaller independent instances, if we find a prefix r , . . . , r of S, which uses an exclusive sub-alphabet , It remains to be shown that the obtained subgraph G 1 p ′ ′ i.e., r , . . . , r ∈ � and r , . . . , r ∈ � \ � . This is acyclic. We can directly conclude that G contains 1 p p+1 n prefix rule can be applied in linear time by starting with no triangles, since every triangle sign � has to be (i,j,k) the prefix r and extending it until we either reach the taken, prohibiting either (i, j), (j, k) or (k, i) (or two of 1 ′ ′ end of S, in which case no independent suffix exists, or them) to be part of G . Assume that G contains a cycle until the prefix is closed regarding the used characters. v , v , v , . . . , v , v of length l ≥ 4 . Then, either (v , v ) i i i i i i i 1 2 3 1 1 3 Let p be the index of the last occurrence of σ(r ) . Since or (v , v ) must be in G . The latter would lead to a tri - 1 i i 3 1 σ(r ) is used in the prefix, all runs r , . . . , r must belong angle, which we could already exclude from G . But 1 2 p to the prefix as well. Now start with l = 2 and update p (v , v ) ∈ G implies that a circle of length l − 1 also i i 1 3 ′ ′ to the index of the last occurrence of σ(r ) (if this index exists in G . Repeated use of this argument implies that G l is higher than p), increase l by 1 and repeat until l > p . If also has a cycle with length 3, which is a contradiction to p < n , an independent prefix is found, otherwise such a triangles being excluded. Thus, G cannot contain a cycle prefix does not exist. of length 4 or greater and must be acyclic. Schrinner et al. Algorithms Mol Biol (2021) 16:11 Page 6 of 11 This idea can be extended to the infix rule, which finds programming (DP) approach, which remains fast for long independent infixes via the following lemma. strings, but suffers from large alphabets. Both algorithms work exclusively on the runs of an input string S. Lemma 2 Let S, T be two strings over the disjoint alpha- ILPs are a commonly used technique to model and bets and and let l be an arbitrary position in S. solve combinatorial optimization problems. We model S T Then it holds that the LRS formulation from before as an ILP in the fol- lowing way: Let n be the number of runs in S and let LRS(T ) LRS s . . . s Ts . . . s = LRS s . . . s $ s . . . s 1 l l+1 m 1 l l+1 m x , . . . , x be binary variables with x = 1 if r is in the 1 n i i optimal subsequence and x = 0 otherwise. Any pos- with $ ∈ ∪ . S T sible subsequence of runs can therefore be represented by a variable assignment. Since we want to maximize the length of the subsequence, we define our objective func - Proof For the same reason as in Lemma 1 the instance tion as the weighted sum over all taken runs, using their T can be solved independently from S. For the combined lengths as weights. Let r , r be two runs with i < j and i j string s . . . s Ts . . . s the infix T is either entirely 1 l l+1 m σ(r ) = σ(r ) . If both runs are selected, all intermediate i j dropped in the optimal subsequence or the optimal solu- runs x with a different character must be excluded. This tion of T itself is entirely taken as a part of the combined yields the following ILP: solution. Thus, T contributes either 0 or LRS(T ) charac- ters to the optimal combined solution. Therefore, if the max x L(r ) i i (4) solution for T is already known, s . . . s Ts . . . s can 1 l l+1 m i=1 be solved by replacing T with a run of length LRS(T ) of a new character $ . subject to x ≤ 2 − x − x ∀ i < l < j, σ(r ) = σ(r ) �= σ(r ) l i j i j l (5) Following Lemma 2 we can search for an independ- ent infix in S to obtain two smaller instances. Instead x ∈{0, 1}∀ 1 ≤ i ≤ n (6) of starting with r , we start with an arbitrary character During the implementation it turned out that a single, σ ∈ � as anchor and use the infix r , . . . , r as a start with p q more complex constraint for each pair r , r with equal i j r and r being the first and last occurrence of σ , resp e c- p q characters was solved slightly faster by the used ILP tively. Similarly to the prefix search, we iterate over all solver. Thus, we actually use the following equivalent set runs in the infix and move the markers p, q to the left of constraints instead of (5): and right, whenever we encounter a new character with occurrences outside r , . . . , r , until the infix is closed p q x ≤ (j − i) · (2 − x − x ) ∀ i < j, σ(r ) = σ(r ) l i j i j (with respect to used characters) or the entire string is i < l < j contained. This is repeated with every character in as σ(r ) =σ(r ) l i anchor, possibly yielding multiple infixes. Adjacent inde - (7) pendent infixes are merged into larger ones, since we If either r or r are not taken, the respective constraint i j want as many runs as possible to be replaced with a single does not prevent any other combination of runs between run. Infixes, which consist of only one run, are discarded, n 2 them. The total number of constraints is bounded by ⌈ ⌉ because they do not pose an actual reduction. Finding and the number of non-zero entries in the constraint and merging all infixes can be done in time O(n ·|�|). n 2 matrix is bounded by n ·⌈ ⌉ . For a maximum reduction, the rules are applied as fol- lows: First, the prefix rule is iteratively applied on S until Solving with dynamic programming no further independent prefix can be found. Second, the As an alternative to the ILP formulation the problem can infix rule is applied on every sub-instance found so far. also be solved bottom-up by a dynamic program (DP). For every infix found the procedure is repeated by start - Let D[i, F] be the length of an optimal LRS solution for ing with the prefix rule again. r . . . r , which includes r itself and only contains charac- 1 i i ters from F ⊆ . The DP can be initialized with D[0, ∅] = 0 Solving with integer linear programming and D[0, F] =−∞ for F = . Know ∅ n solutions can be We present two algorithms to solve LRS to optimality, extended run by run, always selecting an optimal pre- which have complementary strengths and weaknesses. decessor for each run and keeping track of already used The first is based on an Integer Linear Program (ILP). This characters with the second parameter F. For $ ∈ , le t approach scales well with large alphabets, but struggles R (i) = P (i) | σ ∈ � ∪ {$}, P (i) ≥ P (i) contain S σ σ σ(r ) with a large number of runs. We also propose a dynamic S chrinner et al. Algorithms Mol Biol (2021) 16:11 Page 7 of 11 0 12345678 D[4, {b ,b }] 1 3 D[3, {b }] D[0, {}] 1 2 1 3 3 1 1 3 1 $ b b b b b b b b 1 4 1 3 1 3 2 3 D[6, {b ,b }] 1 3 D[3, {b }] 2 1 3 3 1 1 3 1 D[7, {b ,b ,b }]= b b b b b b b b 1 2 3 1 4 1 3 1 3 2 3 Fig. 3 Graph visualizing the recursion for the running example. Arcs represent the possible predecessors for every run. Colors mark an optimal path and the DP entries taken by the recursion the positions of the last occurences for every σ ∈ � , be tween backtracking information from the DP to obtain the cor- position i and the last occurence of σ(r ) before i (or 0 if i responding subsequence. The DP table has a total of n + 1 || is the first occurence of its kind). If r , r are two consecu- columns and 2 rows with each entry taking O(|�|) i j tive runs of an optimal solution, there can be no other runs time to compute. This leads to a worst-case runtime of || between i and j using the same character, as this would make O ||· n · 2 for the DP, making this a fixed parameter the solution sub-optimal. Thus, if an optimal solution con - tractable (FPT) approach for LRS with the alphabet size tains a run r , it either is the first selected run or the prede - as parameter. cessing run must be from a position j ∈ R (i) . This restricts the number of possible predecessors for each run in the DP Experiments by O(|σ |) . The full DP is then as follows: We performed computational experiments on two dif- ferent types of instances. First, we generated random D[0, ∅] = 0 (8) instances to see how the two algorithms scale on string length and alphabet size. Second, we integrated the algo- D[0, F] =−∞ ∀F �= ∅ rithms into the software SyRI [7], which finds structural D[j, F]+ L(r ) if σ(r ) = σ(r ) i j i rearrangements between two assemblies of related spe- D[i, F] = max D[j, F \{σ(r )}] + L(r ), if σ(r ) �= σ(r ) j∈R (i) i i j i cies and has an additional stage for homology-based scaf- (9) folding, where the algorithms are used. The recursion can be visualized by a directed acyclic The ILP has been implemented using the Python inter - graph as shown in Fig. 3. It contains a start vertex cor- face of PuLP, which solves the ILP with the free solver responding to the empty prefix of S and one vertex for CoinOR. All tests were run on an AMD Epyc 7742 pro- every run in S. Every path in the graph corresponds to a cessor with 1TB of memory running on Debian. The (possibly invalid for LRS) subsequence of S. Each vertex i algorithms are implemented in Python and executed via has an incoming edge from each position j ∈ R (i). S Snakemake [10] using Python 3.9.1 and PuLP version D[i, F] is computed by taking all possible predeces- 2.3.1. sor positions j and extending the solutions by r . If σ(r ) = σ(r ) , the solution is extended by r without i j i Synthetic data introducing a new character. The length for the new The synthetic data was created by randomly generat - solution would be the optimal length for positon j, ing strings with different lengths and alphabet sizes. For using the same sub-alphabet F and adding the length of any combination a total of 20 strings was generated, such r . For σ(r ) = σ(r ) the used sub-alphabet must also be i i j that every string is guaranteed to use the entire alphabet extended by σ(r ) , requiring to look up the previous solu- i assigned to it. These instances pose worst-case instances tion from D[i, F \{σ(r )}] instead of D[i, F]. i for our algorithms, as the proposed reduction rules can An optimal solution for LRS can be found by tak- ing the entry of D with the highest length and using the https:// github. com/ coin- or/ pulp. Schrinner et al. Algorithms Mol Biol (2021) 16:11 Page 8 of 11 String lengthscaling (running time) DP(6) DP(10) DP(16) DP(20) −1 ILP(6) ILP(10) −2 ILP(16) ILP(20) −3 −4 20 30 40 50 60 70 80 90 100 String length Alphabetsize scaling (runningtime) DP(20) DP(40) DP(60) DP(80) −1 ILP(20) ILP(40) −2 ILP(60) ILP(80) −3 −4 68 10 12 14 16 18 20 22 24 Alphabet size Fig. 4 Running time plotted against string length (top) and alphabet size (bottom). Each curve represents an algorithm and an additional parameter (number in parentheses), which is alphabet size in the top plot and the string length in the bottom plot hardly be applied. The runs are quite short in general linearly with the number of runs and exponentially with and since there is no structurally induced locality among alphabet size. This is reflected both in running time and the characters, instances could be split very rarely. All memory consumption shown in Fig. 5. Especially the instances were solved with all reductions rules applied. latter is problematic, as alphabet sizes of 24 or higher Figure 4 shows how the runtime scales with both might require more memory than a usual desktop com- increasing string lengths and increasing alphabet size. puter offers. The ILP consumes more memory than the For a fixed alphabet size the runtime scales about expo - DP on small alphabets, but shows no increased mem- nentially with the string length for the ILP as shown in ory footprint as the alphabet size grows. The decreas - the top plot. In fact, the alphabet size only has very minor ing running time for very large alphabets is caused by effect on the ILP compared to the string length, which the reduction rules, as it leads to a higher number of becomes visible in the bottom plot, with a slight favor of characters occurring only in a single run and thus to a larger alphabets. The DP behaves complementary to the higher chance of the string being splittable into inde- ILP, scaling exponentially in the alphabet size and sub- pendent parts. exponentially with string length. Based on this empirical data, the final version of our The scaling can be explained by the properties of the tool uses both algorithms depending on string length algorithms. The ILP has a binary decision variable for and alphabet size. If |s| < 10(|�|− 13) the ILP is pre- every run, increasing the number of possible (but not ferred, otherwise it is the DP. necessarily feasible) variable assignments exponentially with the number of runs. Once the ILP solver has to fall Biological data back to branch-and-bound, the scaling becomes expo- The LRS model is being used to generate homology- nential. Larger alphabets might lead to a lower number based pseudo-chromosome level assemblies in the of constraints (and thus a lower runtime), as the ILP chroder method of SyRI [7], i.e., the process of creating contains one constraint for every pair of runs with the homology-based chromosome-level assemblies in case same character. As already pointed out in Sect. "Solv- only scaffold-level assemblies are available. We consider ing with dynamic programming" the DP table grows avg. runtime / instance (s) avg. runtime/ instance (s) S chrinner et al. Algorithms Mol Biol (2021) 16:11 Page 9 of 11 Alphabet size scaling (memory consumption) DP(20) DP(40) DP(60) DP(80) ILP(20) ILP(40) 10 ILP(60) ILP(80) 68 10 12 14 16 18 20 22 24 Alphabetsize Fig. 5 Memory consumption plotted against alphabet size. Each curve represents a combination of an algorithm a string length, printed in parentheses a dataset which was generated in [7] to test the perfor- Originally the ordering problem in chroder was mance of an approach to find structural rearrangements. solved using a brute-force method, which was unable It consists of 100 fragmented assemblies of varying con- to solve 16 out of 100 instances within a reasonable tiguity that have been generated by introducing 10 to 400 amount of time and memory. We tested these 16 LRS random breaks in chromosome-level assemblies of the instances separately and ran them using the DP and ILP Col-0 and Ler accessions of Arabidopsis thaliana [7, 11]. algorithms presented in this paper. We used the LRS-based chroder method to scaffold these Using all three reduction rules, both algorithms were assemblies in order to estimate the usefulness of the able to solve all instances in very short computation model in generating homology-based pseudo-chromo- time, thus demonstrating the practical efficiency of our somes. The LRS instances were created by mapping both algorithms, see Table 1. For this reason, the previous sets of contigs against each other using nucmer [12] and dividing the contigs into equally long bins afterwards. For each bin the best matching contig of the other contig set is determined based on the previously computed local (a) (b) mappings of each contig. Considering that each bin can only be assigned to one contig and represents one char- acter in the constructed LRS instance, shorter bins pre- serve more information, but also increase the complexity of the LRS instance. The bin size was empirically chosen as 10kb producing reasonable results while maintaining solvable instances. The scripts and used assemblies are made publicly available. For more than 85% of the simulated genomes (both Col-0 and Ler) the pseudo-chromosome N50 values resulting from solving the LRS problem within chroder were five times higher than those of the corresponding fragmented assemblies, with the N50 being even more than ten times higher for more than 30% of the sam- ples (see Figure 6a). For many of the highly fragmented assemblies it is difficult to order fragments because of the presence of repetitive regions in both genomes. Note that, even if the LRS-based method is not able Fig. 6 Performance improvement of using the homology-based to generate the original full length genomes in these scaffolding of chroder. a Increase of N50 values between raw cases, it significantly decreases the number of disjoint (scaffold-level) assemblies and the output of chroder using LRS. fragments, increasing assembly contiguity as shown in Each line represents one of 100 generated fragmented assemblies. b Figure 6b. Decrease in contig count between both assemblies max. memory consumption (MB) Schrinner et al. Algorithms Mol Biol (2021) 16:11 Page 10 of 11 Table 1 Comparison of runtime (in seconds) between DP and ambiguity is ignored, which might drop valuable infor- ILP on instances from real data. The times are for all 16 instances mation. There is also no support for inversions inside that proved difficult for the previous brute-force method. The the model. While inverted alignments can be taken into columns correspond to different reduction rules used account for the mapping step of a single bin, the model stays unaware of inversions and the fact that an interval Algorithm All rules No infix rule No prefix and infix rule of bins is actually in the reverse order compared to the DP 0.006 0.003 Out of memory second assembly. However, this might not be as prob- ILP 0.006 0.006 0.56 lematic as it sounds, because the bins are not mapped to other bins but to entire contigs. As long as inversions are contained in a single contig, they should have no impact brute-force method has been replaced by the implemen- on the ordering that the model produces. tation of our algorithms within the chroder script of the SyRI package. In fact, the instances were almost com- pletely solved by the prefix rule alone, resulting in many Conclusion trivial sub-instances (singleton runs). Note that for these Ordering contigs by means of an incomplete assembly of singleton runs we did not call the ILP solver as the over- a related species occurs as a variant of homology-assisted head of setting up the ILP would have dominated the assembly, which does not require chromosome-level running time. However, not using the prefix or infix rule assemblies already. We introduced the Longest Run Sub- reveals differences between both algorithms. The alpha - sequence (LRS) problem, formalizing the contig ordering bet sizes between 31 and 38 caused the DP to run out of problem as a string problem. We proved that LRS is NP- memory, while the ILP remained fast with the longest hard and presented reduction rules and two algorithms, instance consisting of only 50 runs. which work well for long instances and large alphabets, respectively, which we showed on a synthetic data set. Discussion Regarding real data, we managed to solve all instances Note that the purpose of this paper is not to present a that could not be solved by a brute force approach in full novel scaffolding method but rather to introduce an short computation time. In fact, the original brute-force- algorithm that may prove useful in existing methods for based method in the popular SyRI tool has been replaced scaffolding. We demonstrated its usefulness within the by the open-source implementation of our algorithms. first phase of the SyRI tool that needs chromosome-level From the theoretical side, we find it interesting to fur - assemblies as input. ther investigate approximability and fixed-parameter The experiments showed that optimal LRS solutions tractability of LRS. Some of these suggestions have been can be found in short time for instance sizes that occur picked up in a recent preprint [13]. From a practical per- on assemblies of real samples. We presented two dif- spective, we plan to further test the approach on real ferent algorithms whose running times depend on two assembly data, also taking more than two related assem- important instance properties, namely string length and blies into account. alphabet size. Random strings, however, do not seem to resemble actual assembly instances, which are already Acknowledgements We thank Max Jakub Ried for setting up the software package on PyPI. pre-sorted except for some noise or rearrangements. The reduction rules have little to no impact on random Authors’ contributions strings, while they reduce the assembly instances to All authors contributed to the algorithms and to the design of the study. SSch developed the idea to prove NP-hardness. MW developed the idea for the almost trivial sub-instances. This implies that reduction DP-based approach. SSch and MW implemented the algorithms. SSch ran the rules might be more important in practice than the algo- tests for simulated data. SSch, MG and GWK wrote the paper. All authors read rithm to process the remaining preprocessed instance. and approved the final manuscript. One potential problem of the model itself was men- Funding tioned in Sect. "Problem formulation". LRS only allows Open Access funding enabled and organized by Projekt DEAL. Funded by for one run per character, which automatically induces the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC 2048/1 – projectID 390686111 as an ordering on the underlying contigs. This can be prob - well as under projectID 395192176. lematic if the binned contig contains a translocation that splits a long run into two, e.g., b b b b b b b b b . The 1 1 1 2 2 2 1 1 1 Availability of data and materials The source code and snakemake pipeline to create and run the simulated LRS model will drop one of the b runs, even though it data is available at https:// github. com/ AlBi- HHU/ longe st- run- subse quence. would be better to leave the order of B and B open due 1 2 The software itself can be installed from https:// pypi. org/ proje ct/ longe strun to lack of evidence. subse quence/. A collection of all used code and data, including the experi- ments which two assemblies from Arabidopsis thaliana have been uploaded Another limitation arises while mapping the bins. Since to Zenodo at https:// doi. org/ 10. 5281/ zenodo. 45522 11. only the best match for every bin is taken, any mapping S chrinner et al. Algorithms Mol Biol (2021) 16:11 Page 11 of 11 chromatin interactions. Nat Biotechnol. 2013; 31(12), 1119–1125. https:// Declarations doi. org/ 10. 1038/ nbt. 2727. 6. Jiao W-B, Accinelli GG, Hartwig B, Kiefer C, Baker D, Severing E, Willing E-M, Competing interests Piednoel M, Woetzel S, Madrid-Herrero E, Huettel B, Hümann U, Reinhard The authors declare that they have no competing interests. R, Koch MA, Swan D, Clavijo B, Coupland G, Schneeberger K. Improving and correcting the contiguity of long-read genome assemblies of three Author details plant species using optical mapping and chromosome conformation Algorithmic Bioinformatics, Heinrich Heine University Düsseldorf, Düsseldorf, 2 3 capture data. Genome Res. 2017; 27(5), 778–786. https:// doi. org/ 10. 1101/ Germany. Heinrich Heine University Düsseldorf, Düsseldorf, Germany. Cluster gr. 213652. 116. of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University Düssel- 7. Goel M, Sun H, Jiao W-B, Schneeberger K. SyRI: finding genomic rear - dorf, Düsseldorf, Germany. Max Planck Institute for Plant Breeding Research, rangements and local sequence differences from whole-genome Cologne, Germany. Faculty of Biology, LMU Munich, Großhaderner Str. 2, assemblies. Genome Biol. 2019; 20(1), 277. https:// doi. org/ 10. 1186/ 82152 Planegg-Martinsried, Germany. s13059- 019- 1911-0. 8. Alhakami H, Mirebrahim H, Lonardi S. A comparative evaluation of Received: 23 February 2021 Accepted: 5 June 2021 genome assembly reconciliation tools. Genome Biol. 2017; 18(1), 93. https:// doi. org/ 10. 1186/ s13059- 017- 1213-3. 9. Grötschel M, Jünger M, Reinelt G. A cutting plane algorithm for the linear ordering problem. Operations Res. 1984; 32, 1195–1220. https:// doi. org/ 10. 1287/ opre. 32.6. 1195. References 10. Köster J, Rahmann S. Snakemake–a scalable bioinformatics workflow 1. Alonge M, Soyk S, Ramakrishnan S, Wang X, Goodwin S, Sedlazeck FJ, engine. Bioinformatics. 2012; 28(19), 2520–2522. https:// doi. org/ 10. 1093/ Lippman ZB, Schatz MC. RaGOO: fast and accurate reference-guided scaf- bioin forma tics/ bts480. folding of draft genomes. Genome Biol. 2019;20(1):224. https:// doi. org/ 11. The Arabidopsis Genome Initiative. Analysis of the genome sequence of 10. 1186/ s13059- 019- 1829-6. the flowering plant Arabidopsis thaliana. Nature. 2000;408(6814):796–815. 2. Coombe L, Nikolić V, Chu J, Birol I, Warren RL. ntJoin: Fast and lightweight 12. Marcais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, Zimin A. Mum- assembly-guided scaffolding using minimizer graphs. Bioinformatics. mer4: a fast and versatile genome alignment system. PLOS Comput Biol. 2020. https:// doi. org/ 10. 1093/ bioin forma tics/ btaa2 53. 2018; 14(1), 1–14. https:// doi. org/ 10. 1371/ journ al. pcbi. 10059 44. 3. Tang H, Zhang X, Miao C, Zhang J, Ming R, Schnable JC, Schnable PS, 13. Dondi R, Sikora F. The longest run subsequence problem: Further com- Lyons E, Lu J. ALLMAPS: robust scaffold ordering based on multiple maps. plexity results. arXiV 2020. arXiv: 2011. 08119. Genome Biol. 2015; 16(1), 3. https:// doi. org/ 10. 1186/ s13059- 014- 0573-1. 4. Weisenfeld NI, Kumar V, Shah P, Church DM, Jaffe DB. Direct determina- tion of diploid genome sequences. Genome Res. 2017; 27(5), 757–767. Publisher’s Note https:// doi. org/ 10. 1101/ gr. 214874. 116. Springer Nature remains neutral with regard to jurisdictional claims in pub- 5. Burton JN, Adey A, Patwardhan RP, Qiu R, Kitzman JO, Shendure J. lished maps and institutional affiliations. Chromosome-scale scaffolding of de novo genome assemblies based on Re Read ady y to to submit y submit your our re researc search h ? Choose BMC and benefit fr ? Choose BMC and benefit from om: : fast, convenient online submission thorough peer review by experienced researchers in your field rapid publication on acceptance support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year At BMC, research is always in progress. Learn more biomedcentral.com/submissions
Algorithms for Molecular Biology – Springer Journals
Published: Jun 28, 2021
Keywords: Alignment; Assembly; String algorithm; Longest subsequence
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.