Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A class of phylogenetic networks reconstructable from ancestral profiles

A class of phylogenetic networks reconstructable from ancestral profiles A CLASS OF PHYLOGENETIC NETWORKS RECONSTRUCTABLE FROM ANCESTRAL PROFILES PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL Abstract. Rooted phylogenetic networks provide an explicit represen- tation of the evolutionary history of a set X of sampled species. In contrast to phylogenetic trees which show only speciation events, net- works can also accommodate reticulate processes (for example, hybrid evolution, endosymbiosis, and lateral gene transfer). A major goal in systematic biology is to infer evolutionary relationships, and while phy- logenetic trees can be uniquely determined from various simple com- binatorial data on X , for networks the reconstruction question is much more subtle. Here we ask when can a network be uniquely reconstructed from its `ancestral pro le' (the number of paths from each ancestral ver- tex to each element in X ). We show that reconstruction holds (even within the class of all networks) for a class of networks we call `orchard networks', and we provide a polynomial-time algorithm for reconstruct- ing any orchard network from its ancestral pro le. Our approach relies on establishing a structural theorem for orchard networks, which also provides for a fast (polynomial-time) algorithm to test if any given net- work is of orchard type. Since the class of orchard networks includes tree-sibling tree-consistent networks and tree-child networks, our result generalise reconstruction results from 2008 and 2009. Orchard networks allow for an unbounded number k of reticulation vertices, in contrast to tree-sibling tree-consistent networks and tree-child networks for which k is at most 2jXj 4 and jXj 1, respectively. 1. Introduction Phylogenetic trees and networks have become a ubiquitous tool for repre- senting evolutionary relationships in systematics biology [7] and other areas of classi cation (for example, language evolution and epidemiology). From th early sketches by Charles Darwin and Ernst Haeckel in the 19 century, Date : May 2, 2019. 1991 Mathematics Subject Classi cation. 05C85, 92D15. Key words and phrases. Tree-child networks, orchard networks, accumulation phyloge- nies, ancestral pro les, path-tuples. The rst author was supported in part by the National Research, Development and Innovation Oce (NKFIH grants K 116769 and KH 126853). The second and third authors were supported by the New Zealand Marsden Fund (UOC1709). arXiv:1901.04064v2 [math.CO] 1 May 2019  } 2 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL more complex and detailed trees are now revealing the ner details of por- tions of the `tree of life'. Today, biologists routinely build phylogenetic trees on hundreds of species, such as the recent tree of (nearly) all 10,000 species of birds [14]. Phylogenetic trees have a leaf set X that consists of the sam- pled organisms (typically, a group of present-day species); the root of the tree represents the most recent common ancestor of the species in X . Cur- rent methods for inferring phylogenetic trees trees generally use genomic data from the species in X , and apply one of several possible reconstruction methods. While many of these methods are statistically based, they are ul- timately founded on underlying combinatorial uniqueness results concerning trees [7, 17]. Although phylogenetic trees have proved a convenient representation for many groups of species including, for example, mammals and birds, in other domains of life evolution is not always described as a simple vertical process of speciation (where lineages split in two as new species form) and extinction. Instead, various reticulate processes allow for a `horizontal' component. Two main examples include the formation of hybrid species (such as in certain plant or sh species), and the exchange of genes between species in a process called lateral gene transfer (such as in bacteria). An additional reticulate process relevant to early life on earth is endosymbiosis in which organelles are incorporated into cells. For these reasons, phylogenetic networks (acyclic directed graphs with a single root vertex and leaves forming the set X ) have been proposed as a more exible and accurate representation of evolutionary history [6, 15]. Ac- cordingly, there has been considerable recent interest in extending the math- ematical foundation of phylogenetic tree reconstruction to networks [11]. This extension faces a number of mathematical obstacles. In particular, while trees can be encoded and reconstructed in several ways (for example, based on their associated system of clusters, path distances between pairs of leaves, and induced 3-leaf subtrees), none of these approaches extends to networks, except for in very special cases [9, 12, 19]. This has led to var- ious approaches being proposed, which usually involve one or more of the following: (i) not distinguishing between phylogenetic networks that are similar in a certain way [16]; (ii) considering reconstruction only within a limited subclass of phyloge- netic networks [2]; and (iii) allowing types of information for X beyond what is normally used for tree reconstruction [1]. Approach (ii) has received the most attention so far, with some positive results (for example, for reconstructing the subclass of normal networks ORCHARD NETWORKS 3 from their induced trees [20]). In this paper, we focus more on approach (iii), and, although we restrict to a class of subnetworks (which we call `orchard networks'), our reconstruction result has the additional strength that it can distinguish between any two networks from information on X provided at least one of them is an orchard network. To provide some intuition, informally, a phylogenetic network is an orchard network if it can be reduced to a single vertex by recursively nding a pair of leaves that form either a cherry or a reticulated cherry, and then applying a cherry reduction to that pair of leaves. The type of information on X we consider is the following. View the inte- rior (non-leaf ) vertices of a phylogenetic network N as being labelled. In the biological setting, this label could correspond, for example, to the genome of the ancestral species at this vertex (or some sub-genome that is suciently detailed to distinguish this ancestral vertex from others). For each species x in the leaf set X , suppose we can count the number of directed paths in the network from each ancestral genome (i.e. interior vertex) to x. This `ances- tral pro le' is thus an ordered tuple of numbers, one tuple for each leaf in X (note that current technology does not yet provide this information, so our approach is in the spirit of earlier mathematical results in phylogenetics that preceded the data required for their application). It turns out that such information is not enough to distinguish between an arbitrary pair of networks (we provide an example). However, if the underlying network N is an orchard network, our main result shows that no other network (orchard or not) can have the same ancestral pro le. Moreover, we present and justify a polynomial-time algorithm for reconstructing any orchard network from its ancestral pro le. Our arguments rely on a structural property of orchard networks which also implies that there is a polynomial-time algorithm for testing whether or not an arbitrary network is an orchard network. Our results generalise earlier work in [4, 5] which considered the more restricted classes of `tree-sibling time-consistent' networks and `tree-child' networks, respectively. These authors use equivalent information on X for reconstruction, however, their reconstruction result faces two limitations that are lifted here. First, the uniqueness results of [4, 5] hold only within the class of tree-sibling time-consistent networks and tree-child networks, whereas we show that ancestral pro les can distinguish an orchard net- work from any other network. Second, neither tree-sibling time-consistent networks nor tree-child networks can have too many reticulate vertices (at most and 2n 4 and n 1, respectively, where n = jXj), whereas orchard networks can have arbitrarily many reticulate vertices (independent of n). Our results are also related to (and partly motivated by) earlier work by [1] and [18] on `accumulation phylogenies'. This involved a di erent subclass of networks (called `regular' in these papers, and `cluster networks' in [11]),  } 4 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL which neither contains, nor is contained in the subclass of orchard networks. A limitation of this subclass is that (unlike orchard networks) they do not allow `redundant arcs' (an arc (u; v) for which there is another path in the network from u to v). Allowing redundant arcs has a strong biological motivation since even if each reticulation events happens instantaneously between two contemporaneous species, redundant arcs can still appear in the resulting network if not all species at the present are sampled. The results in [1, 18] also assume any two networks being considered are within this same subclass. In summary, our results are not directly related to this earlier work on accumulation phylogenies, apart from using a related type of information. The paper is organised as follows. The next section contains some neces- sary de nitions along with the statement of the main result (Theorem 2.2) and deduces, as a consequence, the main result (Theorem 1) in [5]. This section also provides examples to justify various claims. Section 3 describes some preliminary lemmas, which apply more generally than for ancestral pro les, and in Section 4 we state and prove the structural property of orchard networks that allows for an easy test as to whether or not an arbi- trary network is of orchard type. The proof of Theorem 2.2 is established in Section 5. We end the paper with a brief discussion in Section 6. Lastly, just as we completed the write-up of this paper, a manuscript [13] was posted on arXiv that also considers the class of orchard networks (re- ferred to as \cherry-picking networks" in [13]). The focus of that manuscript is quite di erent to that of this paper; nevertheless, it contains an indepen- dent and di erent proof of the structural property of orchard networks which is needed as a lemma for Theorem 2.2 in this paper. 2. Main Result Throughout the paper X denotes a non-empty nite set and, unless oth- erwise stated, all paths are directed. For vertices u and v of a directed graph D, we say v is reachable from u if there is a path in D from u to v. Fur- thermore, for sets A and B, we denote the set obtained from A by removing every element in A that is also in B by A B. If jBj = 1, say B = fbg, we denote this by A b. Phylogenetic networks. A phylogenetic network on X is a rooted acyclic directed graph with no arcs in parallel and satisfying the following proper- ties: (i) the (unique) root has in-degree zero and out-degree two; ORCHARD NETWORKS 5 x x x x x x 1 2 3 4 5 6 Figure 1. A phylogenetic network N on fx ; x ; : : : ; x g. 1 2 6 Here, fx ; x g is a cherry and fx ; x g is a reticulated cherry 1 2 3 4 with x the reticulation leaf. (ii) a vertex with out-degree zero has in-degree one, and the set of vertices with out-degree zero is X ; and (iii) all other vertices either have in-degree one and out-degree two, or in- degree two and out-degree one. For technical reasons, if jXj = 1, we additionally allow a single vertex to be a phylogenetic network, in which case, the root is the vertex in X . Phylo- genetic networks as de ned here are also referred to as `binary phylogenetic networks' in the literature. Let N be a phylogenetic network on X . The vertices with out-degree zero are the leaves ofN , and so X is called the leaf set ofN . Furthermore, vertices with in-degree one and out-degree two are tree vertices, while vertices of in- degree two and out-degree one are reticulations. The arcs directed into a reticulation are called reticulation arcs, all other arcs are tree arcs. To illustrate, an example of a phylogenetic network with leaf set fx ; x ; : : : ; x g 1 2 6 and three reticulations is shown in Fig. 1. Lastly, let N and N be two phylogenetic networks on X with vertex and 1 2 arc sets V and E , and V and E , respectively. We say N is isomorphic to 1 1 2 2 1 N if there exists a bijection ' : V ! V such that '(x) = x for all x 2 X , 2 1 2 and (u; v) 2 E if and only if ('(u); '(v)) 2 E for all u; v 2 V . 1 2 1 Ancestral tuples and ancestral pro le. Let N be a phylogenetic net- work on X with vertex set V . Let v ; v ; : : : ; v be a xed (arbitrary) la- 1 2 t belling of the vertices in V X . For all x 2 X , the ancestral tuple of x, denoted (x), is the t-tuple whose i-th entry is the number of paths in N from v to x. Denoted by  , we call the set i N = f(x; (x)) : x 2 Xg; of ordered pairs the ancestral pro le of N . Furthermore, if N is a phylo- genetic network on X and, up to an ordering of the non-leaf vertices of N ,  } 6 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL x x x x x x x x x x x 1 3 4 5 6 1 2 3 4 5 6 (i) N (ii) N 1 2 Figure 2. N has been obtained from N in Fig. 1 by re- ducing x , while N has been obtained from N by cutting 2 2 fx ; x g. 3 4 we have  =  , we say N realises  . Lastly, although  depends N N N N on the ordering of the vertices in V X , the ordering is xed and so the labelling can be e ectively ignored. Cherries and reticulated cherries. Let N be a phylogenetic network on X , and let fa; bg be a 2-element subset of X . Let p and p denote the a b parents of a and b, respectively. We say fa; bg is a cherry of N if p = p . a b Furthermore, if one of the parents, say p , is a reticulation and (p ; p ) is b a b an arc in N , then fa; bg is a reticulated cherry of N , in which case, b is the reticulation leaf of the reticulated cherry. Observe that p is necessarily a tree vertex. For the phylogenetic network shown in Fig. 1, fx ; x g is a 1 2 cherry, while fx ; x g is a reticulated cherry in which x is the reticulation 3 4 4 leaf. Furthermore, in Fig. 1, fx ; x g is neither a cherry nor a reticulated 4 5 cherry. We next describe two operations associated with cherries and reticulated cherries that are central to this paper. Let N be a phylogenetic network. First suppose that fa; bg is a cherry of N . Then reducing b is the operation of deleting b and suppressing the resulting vertex of in-degree one and out- degree one. If the parent of a and of b is the root of N , then reducing b is the operation of deleting b as well as deleting the root of N , thus leaving only the isolated vertex a. Now suppose that fa; bg is a reticulated cherry of N in which b is the reticulation leaf. Then cutting fa; bg is the operation of deleting the reticulation arc joining the parents of a and b, and suppressing the two resulting vertices of in-degree one and out-degree one. It is easily seen that the operations of reducing a cherry and cutting a reticulated cherry both result in a phylogenetic network. Collectively, we refer to these two operations as cherry reductions. To illustrate, the phylogenetic network shown in Fig. 2(i) (resp. Fig. 2(ii)) has been obtained from the phylogenetic network in Fig. 1 by reducing x (resp. cutting fx ; x g). 2 3 4 ORCHARD NETWORKS 7 x x x 2 1 2 (i) (ii) Figure 3. (i) An orchard network and (ii) a non-orchard network. Orchard networks. For a phylogenetic network N , the sequence (1) N = N ;N ;N ; : : : ;N 0 1 2 of phylogenetic networks is a cherry-reduction sequence of N if, for all i 2 f1; 2; : : : ; kg, the phylogenetic network N is obtained from N by i i1 a (single) cherry reduction. The sequence is maximal if N has no cherries or reticulated cherries. If N consists of a single vertex, the sequence is complete, in which case, N is called an orchard network. Observe that if (1) is complete, then the leaf set of N has size two and the parent of each k1 leaf is the root of N . It is easily checked that the phylogenetic network k1 shown in Fig. 1 is an orchard network. In Section 4, we show that if N is an orchard network, then every maximal sequence of cherry reductions of an orchard network N is complete. Thus if we want to construct a complete cherry-reduction sequence for an orchard network, the order in which the reductions are applied does not matter. In turn, this provides an easy test to decide whether or not an arbitrary network is orchard. One of the most well-studied classes of phylogenetic networks is the class of tree-child networks. Introduced in [5], a phylogenetic network is tree- child if every non-leaf vertex is the parent of a tree vertex or a leaf. Tree- child networks are examples of orchard networks [3], but there exist orchard networks that are not tree-child. Indeed, while the size of the leaf set bounds the total number of vertices of a tree-child network [5], the total number of vertices in an orchard network is not necessarily bounded by the size of its leaf set. For example, the phylogenetic network shown in Fig. 3(i) is an orchard network with exactly three leaves but, by extending it in the obvious way, we can produce an orchard network with an arbitrarily large odd number of vertices and still with exactly three leaves. Furthermore, not all phylogenetic networks are orchard networks as Fig. 3(ii) illustrates.  } 8 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL For this paper, a second relevant class of phylogenetic networks is the class of tree-sibling time-consistent networks. Let N be a phylogenetic net- work. We say N is tree-sibling if every reticulation has a parent that is also the parent of a tree vertex or a leaf. Furthermore, N is time-consistent if there is a map t from the vertex set of N to the non-negative integers such that if (u; v) is a reticulation arc of N , then t(u) = t(v); otherwise, t(u) < t(v). We refer to such a mapping as a temporal labelling. In the lit- erature, time-consistent networks are also referred to as temporal networks. Like tree-child networks, the class of tree-sibling time-consistent networks is a proper subclass of orchard networks. For completeness, we include a proof of containment. To see that it is proper, it is shown in [4] that, unlike or- chard networks, the number of reticulations of a tree-sibling time-consistent network is bounded by the size of its leaf set. Lemma 2.1. Let N be a tree-sibling time-consistent network. Then N is an orchard network. Proof. Clearly, the lemma holds if N has no reticulations. Therefore we may assume that N has at least one reticulation. We rst show that N has either a cherry or a reticulated cherry. Let t be a temporal labelling of the vertices of N , and let v be a reticulation with the property that t(v)  t(v ) for all reticulations v of N . Since N is tree-sibling, v has a parent, u say, that is the parent of a vertex w which is either a tree vertex or a leaf. By maximality, no reticulations are reachable from v or w. Therefore, if two leaves are reachable from either v or w, then N has a cherry. If this does not occur, then w is a leaf and that the (unique) child, x say, of v is also a leaf. In particular, fw; xg is a reticulated cherry of N . To complete the proof, let N be obtained from N by a cherry reduction. Clearly, N is also tree-sibling. Furthermore, it is easily checked that the 0 0 mapping t from the vertex set of N to the non-negative integers given by 0 0 0 t (u) = t(u) is a temporal labelling of N . Thus N is tree-sibling time- consistent. The lemma now follows. Main result. The following theorem is the main result of the paper. Theorem 2.2. Let N be an orchard network on X with vertex set V . Then, up to isomorphism, N is the unique phylogenetic network on X realising  . Furthermore, up to isomorphism, N can be reconstructed from  in time 3 3 O(jXj jVj ). It is worth emphasising that the uniqueness of N in the statement of Theorem 2.2 is amongst all phylogenetic networks on X , not just within ORCHARD NETWORKS 9 y y x x x x x x x x 1 2 3 4 1 2 3 4 N N 1 2 Figure 4. Two non-isomorphic phylogenetic networks N and N , but  =  . N N 1 2 the class of orchard networks on X . Furthermore, if N is not an orchard network, then the outcome of Theorem 2.2 does not necessarily hold. In particular, consider the two phylogenetic networks N and N in Fig. 4. It 1 2 is easily checked that by xing an ordering of the non-leaf vertices of each of N and N so that the parent of y is in the same position in both orderings, 1 2 we have  =  . But N is not isomorphic to N . N N 1 2 1 2 Theorem 2.2 generalises results of Cardona et al. [4] and Cardona et al. [5]. Let N be a phylogenetic network on X with vertex set V and let x ; x ; : : : ; x be a xed ordering of the leaves in X . For all v 2 V X , the 1 2 n path tuple of v, denoted (v), is the n-tuple whose i-th entry is the number of paths in N from v to x . Let  denote the multiset i N f(v) : v 2 V Xg of path tuples of N . If N is a phylogenetic network on X and, up to an ordering of X , we have  0 =  , we say N realises  . The next N N N theorem was established in [4] and [5]. Theorem 2.3. Let N be a phylogenetic network on X . (i) If N is tree-sibling time-consistent, then, up to isomorphism, N is the unique tree-sibling time-consistent network on X realising  . (ii) If N is tree-child, then, up to isomorphism, N is the unique tree-child network on X realising  . Furthermore, for both instances, up to isomorphism, N can be constructed from  in time polynomial in the size of X . Let N be a phylogenetic network on X with vertex set V . The set and multiset  are equivalent in the amount of information they provide. To see this, let x ; x ; : : : ; x and v ; v ; : : : ; v be xed orderings of the 1 2 n 1 2 t vertices in X and V X , respectively. Then, for all i 2 f1; 2; : : : ; tg, the  } 10 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL 1 1 2 4 2 3 4 5 6 7 7 8 8 x x x x 1 2 1 2 N N 1 2 Figure 5. Two orchard networks N and N with = 1 2 , but  6=  . N N N 2 1 2 n-tuple (v ) is the tuple whose j-th entry is the i-th entry of (x ) for i j all j 2 f1; 2; : : : ; ng. Similarly, each ordered pair in  can be obtained from  . Thus Theorem 2.2 generalises Theorem 2.3 in two ways. First, it shows that the latter holds for the more general class of orchard networks and, second, the uniqueness is not con ned to the class of networks being constructed. We end the section with three remarks. Firstly, Theorem 2.2 is not the rst reconstruction result concerning the class of orchard networks. Al- though this class was not named, it is shown in [3] that orchard networks are reconstructible from their so-called multiset distance matrices. See [3, Theorem 3.4]. We have no doubt that, over time, the class of orchard net- works will be realised to be reconstructible in other ways as well. The second remark concerns a related, but weaker, notion to that of ancestral tuples called ancestral sets. Let N be a phylogenetic network on X with vertex set V . For all x 2 X , the ancestral set of x is (x) = fv 2 V X : x is reachable from vg: Thus (x) is the set of non-leaf vertices v in N for which there is a directed path from v to x. Observe that, for all x 2 X , the root of N is always an element of (x) and so (x) is non-empty. Let denote the set f(x; (x)) : x 2 Xg of ordered pairs. Given  , it is clear that we can construct in time N N O(jVj). To see that ancestral sets is a weaker notion than ancestral tuples, consider the two orchard networks N and N shown in Fig. 5, where the non-leaf 1 2 vertices have been labelled 1; 2; : : : ; 8. For each i 2 f1; 2g, the ancestral sets ORCHARD NETWORKS 11 of x , x , and x are f1; 2; 3; 4; 5; 7g, f1; 2; : : : ; 8g, and f1; 2; 3g, respectively. 1 2 3 But N is not isomorphic to N . Note that, for a xed ordering of 1; 2; : : : ; 8, 1 2 the ancestral tuple of x di ers in N and N even though the ancestral 2 1 2 tuples of x and x are the same for N and N . Nevertheless, despite this 1 3 1 2 example, the ancestral sets of a phylogenetic network N do provide some information regarding the structure of N . As this is of possible independent interest, we highlight this in the next section where the preliminary lemmas are established in terms of ancestral sets. The third remark concerns the relationship between orchard networks and the increasingly prominent class of tree-based networks [8]. A phylogenetic network N on X with root  and vertex set V is tree-based if it has, as a subgraph, a rooted subtree with root , vertex set V , and leaf set X . Note that  in the subtree may have out-degree one. It is shown in [10] that the class of orchard networks is a proper subclass of tree-based networks. To see that it is proper, observe that the non-orchard networks N and N in 1 2 Fig. 4 are both tree-based. Thus, the networks in this gure also show that Theorem 2.2 does not extend to tree-based networks. 3. Preliminary Lemmas In this section, we establish several results that will be used in the proof of Theorem 2.2. These results show that the ancestral sets, and thus the ances- tral tuples, of an arbitrary phylogenetic network recognise and distinguish cherries and reticulated cherries. Lemma 3.1. Let N be a phylogenetic network on X , and let a and b be distinct elements in X . Then (a)  (b) if and only if the parent of b is reachable from the parent of a. Proof. Let p and p denote the parents of a and b, respectively. If p is a b b reachable from p , then it is clear that (a)  (b). To prove the converse, suppose that (a)  (b). Then p 2 (b) and so, by de nition, b is reachable from p . In turn, this implies that p is reachable from p . a b a The next corollary immediately follows from Lemma 3.1 and the fact that phylogenetic networks are acyclic. Corollary 3.2. Let N be a phylogenetic network on X , and let fa; bg be a 2- element subset of X . Then fa; bg is a cherry in N if and only if (a) = (b). Lemma 3.3. Let N be a phylogenetic network on X , and let fa; bg be a 2-element subset of X . Then fa; bg is a reticulated cherry of N in which b is the reticulation leaf if and only if  } 12 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL (i) (a)( (b), (ii) there is no x 2 X b such that (a)  (x), and (iii) (b) (x) = 1. x2Xb Proof. Let p and p denote the parents of a and b, respectively. It is easily a b checked that if fa; bg is a reticulated cherry in which b is the reticulation leaf, then (i){(iii) hold. So suppose that (i){(iii) hold. Since (i) holds, it follows by Lemma 3.1 that there is a directed path P in N from p to p . a b If p is a tree vertex, then N has a leaf, c say, reachable from p such that b b c 6= b. This implies that (a)  (c), contradicting (ii). Therefore p is a reticulation. Lastly, assume (p ; p ) is not an arc in N . Let u denote the vertex on P immediately prior to p . If u is a tree vertex, then N has a leaf 0 0 c 6= b reachable from u with (a)  (c ), contradicting (ii). On the other hand, if u is a reticulation, then (b) (x)  2; x2Xb contradicting (iii). Thus (p ; p ) is an arc and so fa; bg is a reticulated cherry a b in which b is the reticulation leaf. 4. Order Does Not Matter LetN be an orchard network. Then, by de nition, there exists a complete cherry-reduction sequence for N . But, how do we nd such a sequence and does the order in which we apply the cherry reductions matter? The next proposition says that if we take N and repeatedly apply cherry reductions until no more is possible, we always construct a complete cherry-reduction sequence. A vertex on a directed path is non-terminal if it is neither the rst nor last vertex on the path. Proposition 4.1. Let N be an orchard network, and let (2) N = N ;N ;N ; : : : ;N 0 1 2 ` be a maximal sequence of cherry reductions. Then this sequence is complete. Proof. Let X denote the leaf set of N , and suppose (2) is not complete. Paralleling (2), we begin by constructing a sequence N = M ;M ;M ; : : : ;M 0 1 2 of rooted acyclic directed graphs as follows. If N is obtained from N by 1 0 reducing a leaf of a cherry, then M is obtained from M by deleting the 1 0 same leaf but not suppressing the resulting vertex of in-degree one and out- degree one. Similarly, if N is obtained from N by cutting a reticulated 1 0 ORCHARD NETWORKS 13 cherry, then M is obtained from M by deleting the same reticulation 1 0 arc but not suppressing the two resulting vertices of in-degree one and out- degree one. More generally, if N is obtained from N by reducing a leaf of i i1 a cherry, that is, deleting a leaf b say and suppressing its parent p , then M b i is obtained from M by deleting b as well as deleting every non-terminal i1 vertex on the (unique) path from p to b in M . Note that each of these b i1 non-terminal vertices has in-degree one and out-degree one in M . On i1 the other hand, if N is obtained from N by cutting a reticulated cherry, i i1 that is, deleting a reticulation arc (p ; p ) and suppressing p and p , then a b a b M is obtained from M by deleting (p ; p ). Observe that, for all i, if we i i1 a b suppress every vertex in M of in-degree one and out-degree one, we obtain N . Thus M is a subdivision of N for all i, that is, N can be obtained i i i i from M by suppressing all vertices of in-degree one and out-degree one for all i. Furthermore, as (2) is not complete, the root  of N is never deleted and so, for all i, the root of M is also  and has out-degree two in M . i i We now analyse M . Since (2) is maximal and not complete, N has ` ` at least one reticulation. This implies that M has at least one vertex of in-degree two and out-degree one. We next show that every non-terminal vertex in M on a path from  to a vertex of in-degree two and out-degree one has degree three. 4.1.1. Let v be a vertex of in-degree two and out-degree one in M . If u is a non-terminal vertex of M on a path in M from  to v, then u has ` ` degree three in M . Proof. Suppose u is a vertex of in-degree one and out-degree one on a path from  to v in M . In N , the vertex u has degree three. Therefore, for some i 2 f1; 2; : : : ; `g, we have that N is obtained from N by a cherry i i1 reduction in which an arc incident with u is deleted. Now, as v is a vertex of in-degree two and out-degree one in M , it follows that v is a reticulation in N , and therefore a reticulation in N . Thus there is a path P in N from ` i i u to v. It is now easily checked that no cherry reduction applied to N i1 in which an arc incident with u and not lying on P is deleted is possible. Hence u has degree-three. We now complete the proof of the proposition. Since N is orchard, there is a sequence 0 0 0 0 N = N ;N ;N ; : : : ;N 0 1 2 k of cherry reductions such that N consists of a single vertex. Let i be the 0 0 smallest index such that N is obtained from N by cutting a reticulated i i1 cherry in which the deleted reticulation arc, (u; v) say, has the property that v is in M and it has in-degree two and out-degree one in M . Observe that, ` ` by the choice of i, no vertex of in-degree two and out-degree one is reachable  } 14 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL from v in M except v itself. As (2) is maximal, this implies that there is a unique vertex, ` say, in X that is reachable from v in M . v ` Now, u is a tree vertex in N whose other child, in addition to v, is i1 a leaf. By (4.1.1), u has degree-three in M . Furthermore, as u is a tree vertex in N , it follows that u has in-degree one and out-degree two in i1 M . Let w denote the child of u in M that is not v. At least one vertex in ` ` X is reachable from w in M and this vertex is not ` . If, in M , there is ` v ` no vertex reachable from w with in-degree two and out-degree one, then (2) is not maximal. Therefore, in M there is such a vertex w reachable from w. In N , the vertex w is a reticulation, and so there is a j 2 f1; 2; : : : ; kg 0 0 such that N is obtained from N by cutting a reticulated cherry in which j j1 a reticulation arc directed into w is deleted. Since (u; v) is the reticulation arc directed into v that is deleted, it follows j < i. But, by the choice of i, we have i < j; a contradiction. We conclude that (2) is complete. The following corollary is an immediate consequence of Proposition 4.1. Corollary 4.2. Let N be an orchard network, and let fa; bg be a cherry or a reticulated cherry of N . If N is obtained from N by reducing b if fa; bg is a cherry or cutting fa; bg if fa; bg is a reticulated cherry, then N is an orchard network. Since deciding if a given pair of leaves of a phylogenetic network is either a cherry or a reticulated cherry takes constant time and a cherry reduction also takes constant time, the last corollary gives a polynomial-time algorithm for deciding if an arbitrary phylogenetic network N is orchard. In particular, repeatedly nd a cherry or a reticulated cherry, and apply the appropriate cherry reduction until this process is no longer possible. This takes at most O(jVj) iterations, where V is the vertex of N . If at the completion of this process, we have a phylogenetic network consisting of a single vertex, then N is orchard; otherwise, N is not orchard. Observe that if N is orchard with n leaves and k reticulations, then this process consists of n + k 1 cherry reductions. 5. Proof of Theorem 2.2 In this section, we prove Theorem 2.2. For a phylogenetic network N , Corollary 3.2 and Lemma 3.3 show that it is straightforward to recognise cherries and reticulated cherries of N using only the ancestral sets, and thus the ancestral tuples, of N . This fact is freely used throughout this section. We next describe two operations on tuples that parallel the operations of reducing a cherry and cutting a reticulated cherry. ORCHARD NETWORKS 15 Let X be a non-empty nite set and, for some xed t, let = f(x; (x)) : x 2 Xg be a set of ordered pairs, where, for all x 2 X , we have that (x) is a t-tuple whose entries are either non-negative integers or . Note that the symbol is going to be used as a placeholder. Let fa; bg be a 2-element subset of X . The rst operation will be used only in association with reducing b when fa; bg is a cherry. Let j 2 f1; 2; : : : ; tg such that  (a) =  (b) = 1, j j but  (x) = 0 for all x 2 X fa; bg. Let  be the set of jX bj ordered pairs obtained from  as follows. For all x 2 X b, set  (x) so that the i-th entry is (x); if i 6= j; (x) = ; if i = j. 0 0 0 Set  = f(x;  (x)) : x 2 X bg. We say that  has been obtained from by reducing b. The second operation will be used only in association with cutting fa; bg when fa; bg is a reticulated cherry in which b is the reticulation leaf. Let j 2 f1; 2; : : : ; tg such that  (a) = 1 =  (b) but  (x) = 0 for all x 2 Xfa; bg, j j j and let k 2 f1; 2; : : : ; tg such that  (b) = 1 but  (x) = 0 for all x 2 Xb. k k Let  be the set of jXj ordered pairs obtained from  as follows. For all x 2 X b, set  (x) so that the i-th entry is (x); if i 62 fj; kg; (x) = ; if i 2 fj; kg; and set  (b) so that the i-th entry is (b)  (a); if i 62 fj; kg; i i (b) = ; if i 2 fj; kg. 0 0 0 Set  = f(x;  (x) : x 2 Xg. We say that  has been obtained from  by cutting fa; bg. Lemma 5.1. Let N be a phylogenetic network on X with vertex set V and jXj  2, and x an ordering of V X . Let fa; bg be a 2-element subset of X . (i) If fa; bg is a cherry of N , then, up to entries with symbol , the set of ordered pairs obtained from  by reducing b is the ancestral pro le of the phylogenetic network N obtained from N by reducing b. (ii) If fa; bg is a reticulated cherry of N in which b is the reticulation leaf, then, up to entries with symbol , the set of ordered pairs obtained from  by cutting fa; bg is the ancestral pro le of the phylogenetic network N obtained from N by cutting fa; bg.  } 16 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL Proof. We prove the lemma for (ii). The proof of the lemma for (i) is similar, but easier, and omitted. Suppose fa; bg is a reticulated cherry of N in which b is the reticulation leaf, and N is obtained from N by cutting fa; bg. Let be the set of ordered pairs obtained from  by cutting fa; bg. We will show that  is the ancestral pro le of a phylogenetic network isomorphic to N . Let V denote the vertex set of N , and x an ordering v ; v ; : : : ; v of the 1 2 t vertices in V X . Let p and p denote the parents of a and b, respectively, a b in N . Set U = fv 2 V X :  (a) = 1 =  (b),  (x) = 0 for all x 2 X fa; bgg a j j j j and U = fv 2 V X :  (b) = 1,  (x) = 0 for all x 2 X bg: b k k k Observe that U and U are both non-empty as p 2 U and p 2 U , but a a a b b b U \ U is empty. 0 0 Now consider  . To obtain  from  , we chose (i) an entry in (a), say j, such that  (a) = 1 =  (b) but  (x) = 0 for all x 2 X fa; bg, j j j and (ii) an entry in (b), say k, such that  (b) = 1 but  (x) = 0 for all k k x 2 X b. In particular, these chosen entries correspond to vertices, v and v say, in U and U , respectively. k a b Let N denote the phylogenetic network obtained from N by bijectively relabelling the vertices in U with the vertices in U so that p is relabelled a a a v , and bijectively relabelling the vertices in U with the vertices in U so that j b b p is relabelled v . Clearly, N is isomorphic to N and  is the ancestral b k 1 N pro le of N . Furthermore, it is easily checked that, up to isomorphism, is the ancestral pro le of the phylogenetic network N obtained from N by 0 0 cutting fa; bg. But N is isomorphic to N , thereby completing the proof of the lemma. With Lemma 5.1 in hand, we next prove the uniqueness part of Theo- rem 2.2 Proof of the uniqueness part of Theorem 2.2. The proof is by induction on the sum of the number n of leaves and the number k of reticulations in N . If n + k = 1, then n = 1 and k = 0, and N consists of the single vertex in X , and so uniqueness holds. If n + k = 2, then, as N is orchard, n = 2 and k = 0, in which case, N consists of two leaves attached to the root. Again, uniqueness holds. Now suppose that n + k  3 and the uniqueness holds for all orchard networks for which the sum of the number of leaves and the number of reticulations is at most n + k 1. Note that, as N is orchard, n  2. ORCHARD NETWORKS 17 Since N is orchard, it has either a cherry or a reticulated cherry. Thus, by Corollary 3.2 and Lemma 3.3, it is possible to nd a 2-element subset fa; bg of X using only  such that fa; bg is either a cherry or a reticulated cherry of N . If the latter, we can also determine from  which of a and b is the reticulation. Without loss of generality, we may assume b is the reticulation leaf. Depending on whether fa; bg is a cherry or a reticulated cherry, let N be obtained from N by reducing b or cutting fa; bg, respectively, and let be the set of ordered pairs obtained from  by reducing b or cutting 0 0 fa; bg, respectively. Regardless of the way N and  are obtained, it follows by Corollary 4.2 and Lemma 5.1 that N is an orchard network and, up to 0 0 0 isomorphism,  is the ancestral pro le of N . Furthermore, N has either n 1 leaves and k reticulations if fa; bg is a cherry, or n leaves and k 1 reticulations if fa; bg is a reticulated cherry. Therefore, by the induction as- sumption, up to isomorphism, N is the unique phylogenetic network whose ancestral pro le is  . Now let N be a phylogenetic network on X such that  is the ancestral 1 N pro le of N . Note that N has the same number of non-leaf vertices as N , 1 1 but not necessarily the same number of reticulations. First assume fa; bg is a cherry of N . Then, by Corollary 3.2, fa; bg is a cherry of N . Let N denote the phylogenetic network obtained from N by reducing b. By 0 0 Lemma 5.1(i), up to isomorphism,  is the ancestral pro le of N . Thus, by 0 0 the induction assumption, N is isomorphic to N . Since fa; bg is a cherry of N and N , it follows that N is isomorphic to N . 1 1 Lastly, assume fa; bg is a reticulated cherry of N . Then, by Lemma 3.3, fa; bg is a reticulated cherry of N in which b is the reticulation leaf. Let N be the phylogenetic network obtained from N by cutting fa; bg. By 0 0 Lemma 5.1(ii), up to isomorphism,  is the ancestral pro le of N . Hence, 0 0 by the induction assumption, N is isomorphic to N . As fa; bg is a retic- ulated cherry of N and N in which b is the reticulation leaf, we have that N is isomorphic to N . This completes the proof of the uniqueness part of Theorem 2.2. 5.1. The algorithm. Let N be an orchard network on X , and let  denote the ancestral pro le of N . Called Orchard Tuple, we next describe an algorithm which takes as its input X and , and returns a phylogenetic network N on X that is isomorphic to N . The proof that the algorithm works correctly is essentially the same as that used to prove the uniqueness part of Theorem 2.2, and so it is omitted. The running time of the algorithm follows its description. 1. If jXj = 1, then return the phylogenetic network consisting of the single vertex in X .  } 18 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL 2. Else, nd a 2-element subset, fa; bg say, of X such that either (I) (a) = (b) or (II) (a)  (b), there is no x 2 X b with (a)  (x), and (b) (x) = 1: x2Xb (a) If fa; bg satis es (I) (in which case fa; bg is a cherry), then (i) Reduce b in  to give the set  of jX bj ordered pairs. 0 0 (ii) Apply Orchard Tuple to input X = Xb and  . Construct 0 0 N from the returned phylogenetic network N on X by subdi- viding the arc incident to a with a new vertex p , and adjoining a new leaf b via the new arc (p ; b). If jX j = 1, then set N a 1 to be the phylogenetic network consisting of the leaves a and b adjoined to the root. Return N . (b) Else, fa; bg satis es (II) (in which case fa; bg is a reticulated cherry and b is the reticulation leaf ). (i) Cut fa; bg in  to give the set  of jXj ordered pairs. (ii) Apply Orchard Tuple to X and  . Construct N from the returned phylogenetic network N on X by subdividing the arcs incident to a and b with new vertices p and p , respectively, and a b adding the new arc (p ; p ). Return N . a b 1 We now consider the running time of Orchard Tuple. The input to the algorithm is a set X and the ancestral pro le of an orchard network N on X whose entries are either a non-negative integer or the symbol . Let V denote the vertex set of N . As noted earlier, the set = f(x; (x)) : x 2 Xg can be determined from  in O(jVj) time. This is a preprocessing step and it will have no e ect on the theoretical running time. Except for when jXj 2 f1; 2g, in which case, Orchard Tuple runs in constant time, each iteration begins by nding a 2-element subset of X satisfying either (I) or 2 2 (II). This takes O(jXj jVj) time as there are O(jXj ) two-element subsets of X and each subset takes O(jVj) time to decide if is satis es either (I) or (II). Once such a 2-element is found, we construct  . Regardless of the 0 0 way  is constructed, this takes O(jXjjVj) time. When N is returned, we 3 2 augment to N in constant time, and so each iteration takes O(jXj jVj ) time. When we recurse,  is the ancestral pro le of an orchard network with either one less leaf or one less reticulation than an orchard network for which is the ancestral pro le. Thus the total number of iterations is O(jVj). 3 3 We conclude that Orchard Tuple completes in O(jXj jVj ) time. This completes the proof of Theorem 2.2. ORCHARD NETWORKS 19 6. Conclusion The main result of this paper, Theorem 2.2, shows that the ancestral pro le of an orchard network N on X uniquely determines N amongst all phylogenetic networks on X . This generalises results in both [4] and [5], which considered tree-sibling time-consistent networks and tree-child net- works (subclasses of orchard networks whose number of reticulations is at most linear in the number of leaves). Curiously, these later results have a di erent motivation compared to what motivated Theorem 2.2. There the motivation is to construct a distance measure (metric) on the classes of tree-sibling time-consistent networks and tree-child networks which is com- putable in polynomial time. Recalling that they considered the equivalent notion of path-tuples, for two tree-sibling time-consistent (resp. tree-child) networks N and N , the distance between N and N is the value 1 2 1 2 j 4 j ; N N 1 2 where the symmetric di erence and the cardinality operator refer to mul- tisets. It is easily checked that this same measure extends to the class of orchard networks. As noted in the introduction, our result does not relate to speci c bi- ological data that is readily available at present. However, a type of data that might provide ancestral pro le information would be genomic fragments that follow lineage splitting and reticulation events, so that when a reticu- lation occurs, a trace of each fragment in the incoming lineage is preserved in (di erent regions of ) the reticulate genome. Lastly, we end with a question asked by one of the referees. For a given orchard network N , is it possible to count the number of complete cherry- reduction sequences of N ? Acknowledgements We thank the three anonymous referees for their careful reading of the paper and constructive comments. References [1] M. Baroni, M. Steel, Accumulation phylogenies, Annals of Combinatorics 10 (2006) 19{30. [2] M. Bordewich, K.T. Huber, V. Moulton, C. Semple, Recovering normal networks from shortest inter-taxa distance information, Journal of Mathematical Biology 77 (2018) 571{594.  } 20 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL [3] M. Bordewich, C. Semple, Determining phylogenetic networks from inter-taxa dis- tances, Journal of Mathematical Biology 73 (2016) 283{303. [4] G. Cardona, M. Llabr es, F. Rossell o, G. Valiente, A distance metric for a class of tree-sibling phylogenetic networks 24 (2008) 1481{1488. [5] G. Cardona, F. Rossell o, G. Valiente, Comparison of tree-child phylogenetic networks, IEEE/ACM Transactions on Computational Biology and Bioinformatics 6 (2009) 552{569. [6] W.F. Doolittle, Phylogenetic classi cation and the universal tree, Science 284 (1999) 2124{2128. [7] J. Felsenstein, Inferring Phylogenies, Sinauer Associates, Sunderland, MA, 2004. [8] A. R. Francis, M. Steel, Which phylogenetic networks are merely trees with additional arcs?, Systematic Biology 64 (2015) 768{777. [9] P. Gambette, K.T. Huber, On encodings of phylogenetic networks of bounded level, Journal of Mathematical Biology 65 (2012) 157{180. [10] K.T. Huber, L. van Iersel, R. Janssen, M. Jones, V. Moulton, Y. Murakami, C. Semple, Rooting for phylogenetic networks, in preparation. [11] D.H. Huson, R. Rupp, C. Scornavacca, Phylogenetic Networks: Concepts, Algorithms and Applications, Cambridge University Press, 2010. [12] L. van Iersel, V. Moulton, Trinets encode tree-child and level-2 phylogenetic networks, Journal of Mathematical Biology 68 (2014) 1707{1729. [13] R. Janssen, Y. Murakami, Solving phylogenetic network containment problem using cherry-picking sequences, arXiv:1812.08065 (2018). [14] W. Jetz, G.H. Thomas, J.B. Joy, K. Hartmann, A.O. Mooers, The global diversity of birds in space and time, Nature 491 (2012) 444{448. [15] E.V. Koonin, The turbulent network dynamics of microbial evolution and the statis- tical tree of life, Journal of Molecular Evolution 80 (2015) 244{250. [16] F. Pardi, C. Scornavacca, Reconstructible phylogenetic networks: Do not distinguish the indistinguishable, PLoS Computational Biology 11 (2015) e1004135. [17] C. Semple, M. Steel, Phylogenetics, Oxford University Press, Oxford, 2003. [18] S.J. Willson, Reconstruction of certain phylogenetic networks from the genomes at their leaves, Journal of Theoretical Biology 252 (2008) 338{349. [19] S.J. Willson, Properties of normal phylogenetic networks, Bulletin of Mathematical Biology 72 (2010) 340{358. [20] S.J. Willson, Regular networks can be uniquely constructed from their trees, IEEE/ACM Transactions on Computational Biology and Bioinformatics 8 (2011) 785{796. ORCHARD NETWORKS 21 Alfred Renyi Institute of Mathematics, Hungarian Academy of Sciences, Budapest, Hungary E-mail address : erdos.peter@renyi.mta.hu School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand E-mail address : charles.semple@canterbury.ac.nz School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand E-mail address : mike.steel@canterbury.ac.nz http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Mathematics arXiv (Cornell University)

A class of phylogenetic networks reconstructable from ancestral profiles

Mathematics , Volume 2021 (1901) – Jan 13, 2019

Loading next page...
 
/lp/arxiv-cornell-university/a-class-of-phylogenetic-networks-reconstructable-from-ancestral-yDOD4PO5lY

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

ISSN
0025-5564
eISSN
ARCH-3343
DOI
10.1016/j.mbs.2019.04.009
Publisher site
See Article on Publisher Site

Abstract

A CLASS OF PHYLOGENETIC NETWORKS RECONSTRUCTABLE FROM ANCESTRAL PROFILES PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL Abstract. Rooted phylogenetic networks provide an explicit represen- tation of the evolutionary history of a set X of sampled species. In contrast to phylogenetic trees which show only speciation events, net- works can also accommodate reticulate processes (for example, hybrid evolution, endosymbiosis, and lateral gene transfer). A major goal in systematic biology is to infer evolutionary relationships, and while phy- logenetic trees can be uniquely determined from various simple com- binatorial data on X , for networks the reconstruction question is much more subtle. Here we ask when can a network be uniquely reconstructed from its `ancestral pro le' (the number of paths from each ancestral ver- tex to each element in X ). We show that reconstruction holds (even within the class of all networks) for a class of networks we call `orchard networks', and we provide a polynomial-time algorithm for reconstruct- ing any orchard network from its ancestral pro le. Our approach relies on establishing a structural theorem for orchard networks, which also provides for a fast (polynomial-time) algorithm to test if any given net- work is of orchard type. Since the class of orchard networks includes tree-sibling tree-consistent networks and tree-child networks, our result generalise reconstruction results from 2008 and 2009. Orchard networks allow for an unbounded number k of reticulation vertices, in contrast to tree-sibling tree-consistent networks and tree-child networks for which k is at most 2jXj 4 and jXj 1, respectively. 1. Introduction Phylogenetic trees and networks have become a ubiquitous tool for repre- senting evolutionary relationships in systematics biology [7] and other areas of classi cation (for example, language evolution and epidemiology). From th early sketches by Charles Darwin and Ernst Haeckel in the 19 century, Date : May 2, 2019. 1991 Mathematics Subject Classi cation. 05C85, 92D15. Key words and phrases. Tree-child networks, orchard networks, accumulation phyloge- nies, ancestral pro les, path-tuples. The rst author was supported in part by the National Research, Development and Innovation Oce (NKFIH grants K 116769 and KH 126853). The second and third authors were supported by the New Zealand Marsden Fund (UOC1709). arXiv:1901.04064v2 [math.CO] 1 May 2019  } 2 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL more complex and detailed trees are now revealing the ner details of por- tions of the `tree of life'. Today, biologists routinely build phylogenetic trees on hundreds of species, such as the recent tree of (nearly) all 10,000 species of birds [14]. Phylogenetic trees have a leaf set X that consists of the sam- pled organisms (typically, a group of present-day species); the root of the tree represents the most recent common ancestor of the species in X . Cur- rent methods for inferring phylogenetic trees trees generally use genomic data from the species in X , and apply one of several possible reconstruction methods. While many of these methods are statistically based, they are ul- timately founded on underlying combinatorial uniqueness results concerning trees [7, 17]. Although phylogenetic trees have proved a convenient representation for many groups of species including, for example, mammals and birds, in other domains of life evolution is not always described as a simple vertical process of speciation (where lineages split in two as new species form) and extinction. Instead, various reticulate processes allow for a `horizontal' component. Two main examples include the formation of hybrid species (such as in certain plant or sh species), and the exchange of genes between species in a process called lateral gene transfer (such as in bacteria). An additional reticulate process relevant to early life on earth is endosymbiosis in which organelles are incorporated into cells. For these reasons, phylogenetic networks (acyclic directed graphs with a single root vertex and leaves forming the set X ) have been proposed as a more exible and accurate representation of evolutionary history [6, 15]. Ac- cordingly, there has been considerable recent interest in extending the math- ematical foundation of phylogenetic tree reconstruction to networks [11]. This extension faces a number of mathematical obstacles. In particular, while trees can be encoded and reconstructed in several ways (for example, based on their associated system of clusters, path distances between pairs of leaves, and induced 3-leaf subtrees), none of these approaches extends to networks, except for in very special cases [9, 12, 19]. This has led to var- ious approaches being proposed, which usually involve one or more of the following: (i) not distinguishing between phylogenetic networks that are similar in a certain way [16]; (ii) considering reconstruction only within a limited subclass of phyloge- netic networks [2]; and (iii) allowing types of information for X beyond what is normally used for tree reconstruction [1]. Approach (ii) has received the most attention so far, with some positive results (for example, for reconstructing the subclass of normal networks ORCHARD NETWORKS 3 from their induced trees [20]). In this paper, we focus more on approach (iii), and, although we restrict to a class of subnetworks (which we call `orchard networks'), our reconstruction result has the additional strength that it can distinguish between any two networks from information on X provided at least one of them is an orchard network. To provide some intuition, informally, a phylogenetic network is an orchard network if it can be reduced to a single vertex by recursively nding a pair of leaves that form either a cherry or a reticulated cherry, and then applying a cherry reduction to that pair of leaves. The type of information on X we consider is the following. View the inte- rior (non-leaf ) vertices of a phylogenetic network N as being labelled. In the biological setting, this label could correspond, for example, to the genome of the ancestral species at this vertex (or some sub-genome that is suciently detailed to distinguish this ancestral vertex from others). For each species x in the leaf set X , suppose we can count the number of directed paths in the network from each ancestral genome (i.e. interior vertex) to x. This `ances- tral pro le' is thus an ordered tuple of numbers, one tuple for each leaf in X (note that current technology does not yet provide this information, so our approach is in the spirit of earlier mathematical results in phylogenetics that preceded the data required for their application). It turns out that such information is not enough to distinguish between an arbitrary pair of networks (we provide an example). However, if the underlying network N is an orchard network, our main result shows that no other network (orchard or not) can have the same ancestral pro le. Moreover, we present and justify a polynomial-time algorithm for reconstructing any orchard network from its ancestral pro le. Our arguments rely on a structural property of orchard networks which also implies that there is a polynomial-time algorithm for testing whether or not an arbitrary network is an orchard network. Our results generalise earlier work in [4, 5] which considered the more restricted classes of `tree-sibling time-consistent' networks and `tree-child' networks, respectively. These authors use equivalent information on X for reconstruction, however, their reconstruction result faces two limitations that are lifted here. First, the uniqueness results of [4, 5] hold only within the class of tree-sibling time-consistent networks and tree-child networks, whereas we show that ancestral pro les can distinguish an orchard net- work from any other network. Second, neither tree-sibling time-consistent networks nor tree-child networks can have too many reticulate vertices (at most and 2n 4 and n 1, respectively, where n = jXj), whereas orchard networks can have arbitrarily many reticulate vertices (independent of n). Our results are also related to (and partly motivated by) earlier work by [1] and [18] on `accumulation phylogenies'. This involved a di erent subclass of networks (called `regular' in these papers, and `cluster networks' in [11]),  } 4 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL which neither contains, nor is contained in the subclass of orchard networks. A limitation of this subclass is that (unlike orchard networks) they do not allow `redundant arcs' (an arc (u; v) for which there is another path in the network from u to v). Allowing redundant arcs has a strong biological motivation since even if each reticulation events happens instantaneously between two contemporaneous species, redundant arcs can still appear in the resulting network if not all species at the present are sampled. The results in [1, 18] also assume any two networks being considered are within this same subclass. In summary, our results are not directly related to this earlier work on accumulation phylogenies, apart from using a related type of information. The paper is organised as follows. The next section contains some neces- sary de nitions along with the statement of the main result (Theorem 2.2) and deduces, as a consequence, the main result (Theorem 1) in [5]. This section also provides examples to justify various claims. Section 3 describes some preliminary lemmas, which apply more generally than for ancestral pro les, and in Section 4 we state and prove the structural property of orchard networks that allows for an easy test as to whether or not an arbi- trary network is of orchard type. The proof of Theorem 2.2 is established in Section 5. We end the paper with a brief discussion in Section 6. Lastly, just as we completed the write-up of this paper, a manuscript [13] was posted on arXiv that also considers the class of orchard networks (re- ferred to as \cherry-picking networks" in [13]). The focus of that manuscript is quite di erent to that of this paper; nevertheless, it contains an indepen- dent and di erent proof of the structural property of orchard networks which is needed as a lemma for Theorem 2.2 in this paper. 2. Main Result Throughout the paper X denotes a non-empty nite set and, unless oth- erwise stated, all paths are directed. For vertices u and v of a directed graph D, we say v is reachable from u if there is a path in D from u to v. Fur- thermore, for sets A and B, we denote the set obtained from A by removing every element in A that is also in B by A B. If jBj = 1, say B = fbg, we denote this by A b. Phylogenetic networks. A phylogenetic network on X is a rooted acyclic directed graph with no arcs in parallel and satisfying the following proper- ties: (i) the (unique) root has in-degree zero and out-degree two; ORCHARD NETWORKS 5 x x x x x x 1 2 3 4 5 6 Figure 1. A phylogenetic network N on fx ; x ; : : : ; x g. 1 2 6 Here, fx ; x g is a cherry and fx ; x g is a reticulated cherry 1 2 3 4 with x the reticulation leaf. (ii) a vertex with out-degree zero has in-degree one, and the set of vertices with out-degree zero is X ; and (iii) all other vertices either have in-degree one and out-degree two, or in- degree two and out-degree one. For technical reasons, if jXj = 1, we additionally allow a single vertex to be a phylogenetic network, in which case, the root is the vertex in X . Phylo- genetic networks as de ned here are also referred to as `binary phylogenetic networks' in the literature. Let N be a phylogenetic network on X . The vertices with out-degree zero are the leaves ofN , and so X is called the leaf set ofN . Furthermore, vertices with in-degree one and out-degree two are tree vertices, while vertices of in- degree two and out-degree one are reticulations. The arcs directed into a reticulation are called reticulation arcs, all other arcs are tree arcs. To illustrate, an example of a phylogenetic network with leaf set fx ; x ; : : : ; x g 1 2 6 and three reticulations is shown in Fig. 1. Lastly, let N and N be two phylogenetic networks on X with vertex and 1 2 arc sets V and E , and V and E , respectively. We say N is isomorphic to 1 1 2 2 1 N if there exists a bijection ' : V ! V such that '(x) = x for all x 2 X , 2 1 2 and (u; v) 2 E if and only if ('(u); '(v)) 2 E for all u; v 2 V . 1 2 1 Ancestral tuples and ancestral pro le. Let N be a phylogenetic net- work on X with vertex set V . Let v ; v ; : : : ; v be a xed (arbitrary) la- 1 2 t belling of the vertices in V X . For all x 2 X , the ancestral tuple of x, denoted (x), is the t-tuple whose i-th entry is the number of paths in N from v to x. Denoted by  , we call the set i N = f(x; (x)) : x 2 Xg; of ordered pairs the ancestral pro le of N . Furthermore, if N is a phylo- genetic network on X and, up to an ordering of the non-leaf vertices of N ,  } 6 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL x x x x x x x x x x x 1 3 4 5 6 1 2 3 4 5 6 (i) N (ii) N 1 2 Figure 2. N has been obtained from N in Fig. 1 by re- ducing x , while N has been obtained from N by cutting 2 2 fx ; x g. 3 4 we have  =  , we say N realises  . Lastly, although  depends N N N N on the ordering of the vertices in V X , the ordering is xed and so the labelling can be e ectively ignored. Cherries and reticulated cherries. Let N be a phylogenetic network on X , and let fa; bg be a 2-element subset of X . Let p and p denote the a b parents of a and b, respectively. We say fa; bg is a cherry of N if p = p . a b Furthermore, if one of the parents, say p , is a reticulation and (p ; p ) is b a b an arc in N , then fa; bg is a reticulated cherry of N , in which case, b is the reticulation leaf of the reticulated cherry. Observe that p is necessarily a tree vertex. For the phylogenetic network shown in Fig. 1, fx ; x g is a 1 2 cherry, while fx ; x g is a reticulated cherry in which x is the reticulation 3 4 4 leaf. Furthermore, in Fig. 1, fx ; x g is neither a cherry nor a reticulated 4 5 cherry. We next describe two operations associated with cherries and reticulated cherries that are central to this paper. Let N be a phylogenetic network. First suppose that fa; bg is a cherry of N . Then reducing b is the operation of deleting b and suppressing the resulting vertex of in-degree one and out- degree one. If the parent of a and of b is the root of N , then reducing b is the operation of deleting b as well as deleting the root of N , thus leaving only the isolated vertex a. Now suppose that fa; bg is a reticulated cherry of N in which b is the reticulation leaf. Then cutting fa; bg is the operation of deleting the reticulation arc joining the parents of a and b, and suppressing the two resulting vertices of in-degree one and out-degree one. It is easily seen that the operations of reducing a cherry and cutting a reticulated cherry both result in a phylogenetic network. Collectively, we refer to these two operations as cherry reductions. To illustrate, the phylogenetic network shown in Fig. 2(i) (resp. Fig. 2(ii)) has been obtained from the phylogenetic network in Fig. 1 by reducing x (resp. cutting fx ; x g). 2 3 4 ORCHARD NETWORKS 7 x x x 2 1 2 (i) (ii) Figure 3. (i) An orchard network and (ii) a non-orchard network. Orchard networks. For a phylogenetic network N , the sequence (1) N = N ;N ;N ; : : : ;N 0 1 2 of phylogenetic networks is a cherry-reduction sequence of N if, for all i 2 f1; 2; : : : ; kg, the phylogenetic network N is obtained from N by i i1 a (single) cherry reduction. The sequence is maximal if N has no cherries or reticulated cherries. If N consists of a single vertex, the sequence is complete, in which case, N is called an orchard network. Observe that if (1) is complete, then the leaf set of N has size two and the parent of each k1 leaf is the root of N . It is easily checked that the phylogenetic network k1 shown in Fig. 1 is an orchard network. In Section 4, we show that if N is an orchard network, then every maximal sequence of cherry reductions of an orchard network N is complete. Thus if we want to construct a complete cherry-reduction sequence for an orchard network, the order in which the reductions are applied does not matter. In turn, this provides an easy test to decide whether or not an arbitrary network is orchard. One of the most well-studied classes of phylogenetic networks is the class of tree-child networks. Introduced in [5], a phylogenetic network is tree- child if every non-leaf vertex is the parent of a tree vertex or a leaf. Tree- child networks are examples of orchard networks [3], but there exist orchard networks that are not tree-child. Indeed, while the size of the leaf set bounds the total number of vertices of a tree-child network [5], the total number of vertices in an orchard network is not necessarily bounded by the size of its leaf set. For example, the phylogenetic network shown in Fig. 3(i) is an orchard network with exactly three leaves but, by extending it in the obvious way, we can produce an orchard network with an arbitrarily large odd number of vertices and still with exactly three leaves. Furthermore, not all phylogenetic networks are orchard networks as Fig. 3(ii) illustrates.  } 8 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL For this paper, a second relevant class of phylogenetic networks is the class of tree-sibling time-consistent networks. Let N be a phylogenetic net- work. We say N is tree-sibling if every reticulation has a parent that is also the parent of a tree vertex or a leaf. Furthermore, N is time-consistent if there is a map t from the vertex set of N to the non-negative integers such that if (u; v) is a reticulation arc of N , then t(u) = t(v); otherwise, t(u) < t(v). We refer to such a mapping as a temporal labelling. In the lit- erature, time-consistent networks are also referred to as temporal networks. Like tree-child networks, the class of tree-sibling time-consistent networks is a proper subclass of orchard networks. For completeness, we include a proof of containment. To see that it is proper, it is shown in [4] that, unlike or- chard networks, the number of reticulations of a tree-sibling time-consistent network is bounded by the size of its leaf set. Lemma 2.1. Let N be a tree-sibling time-consistent network. Then N is an orchard network. Proof. Clearly, the lemma holds if N has no reticulations. Therefore we may assume that N has at least one reticulation. We rst show that N has either a cherry or a reticulated cherry. Let t be a temporal labelling of the vertices of N , and let v be a reticulation with the property that t(v)  t(v ) for all reticulations v of N . Since N is tree-sibling, v has a parent, u say, that is the parent of a vertex w which is either a tree vertex or a leaf. By maximality, no reticulations are reachable from v or w. Therefore, if two leaves are reachable from either v or w, then N has a cherry. If this does not occur, then w is a leaf and that the (unique) child, x say, of v is also a leaf. In particular, fw; xg is a reticulated cherry of N . To complete the proof, let N be obtained from N by a cherry reduction. Clearly, N is also tree-sibling. Furthermore, it is easily checked that the 0 0 mapping t from the vertex set of N to the non-negative integers given by 0 0 0 t (u) = t(u) is a temporal labelling of N . Thus N is tree-sibling time- consistent. The lemma now follows. Main result. The following theorem is the main result of the paper. Theorem 2.2. Let N be an orchard network on X with vertex set V . Then, up to isomorphism, N is the unique phylogenetic network on X realising  . Furthermore, up to isomorphism, N can be reconstructed from  in time 3 3 O(jXj jVj ). It is worth emphasising that the uniqueness of N in the statement of Theorem 2.2 is amongst all phylogenetic networks on X , not just within ORCHARD NETWORKS 9 y y x x x x x x x x 1 2 3 4 1 2 3 4 N N 1 2 Figure 4. Two non-isomorphic phylogenetic networks N and N , but  =  . N N 1 2 the class of orchard networks on X . Furthermore, if N is not an orchard network, then the outcome of Theorem 2.2 does not necessarily hold. In particular, consider the two phylogenetic networks N and N in Fig. 4. It 1 2 is easily checked that by xing an ordering of the non-leaf vertices of each of N and N so that the parent of y is in the same position in both orderings, 1 2 we have  =  . But N is not isomorphic to N . N N 1 2 1 2 Theorem 2.2 generalises results of Cardona et al. [4] and Cardona et al. [5]. Let N be a phylogenetic network on X with vertex set V and let x ; x ; : : : ; x be a xed ordering of the leaves in X . For all v 2 V X , the 1 2 n path tuple of v, denoted (v), is the n-tuple whose i-th entry is the number of paths in N from v to x . Let  denote the multiset i N f(v) : v 2 V Xg of path tuples of N . If N is a phylogenetic network on X and, up to an ordering of X , we have  0 =  , we say N realises  . The next N N N theorem was established in [4] and [5]. Theorem 2.3. Let N be a phylogenetic network on X . (i) If N is tree-sibling time-consistent, then, up to isomorphism, N is the unique tree-sibling time-consistent network on X realising  . (ii) If N is tree-child, then, up to isomorphism, N is the unique tree-child network on X realising  . Furthermore, for both instances, up to isomorphism, N can be constructed from  in time polynomial in the size of X . Let N be a phylogenetic network on X with vertex set V . The set and multiset  are equivalent in the amount of information they provide. To see this, let x ; x ; : : : ; x and v ; v ; : : : ; v be xed orderings of the 1 2 n 1 2 t vertices in X and V X , respectively. Then, for all i 2 f1; 2; : : : ; tg, the  } 10 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL 1 1 2 4 2 3 4 5 6 7 7 8 8 x x x x 1 2 1 2 N N 1 2 Figure 5. Two orchard networks N and N with = 1 2 , but  6=  . N N N 2 1 2 n-tuple (v ) is the tuple whose j-th entry is the i-th entry of (x ) for i j all j 2 f1; 2; : : : ; ng. Similarly, each ordered pair in  can be obtained from  . Thus Theorem 2.2 generalises Theorem 2.3 in two ways. First, it shows that the latter holds for the more general class of orchard networks and, second, the uniqueness is not con ned to the class of networks being constructed. We end the section with three remarks. Firstly, Theorem 2.2 is not the rst reconstruction result concerning the class of orchard networks. Al- though this class was not named, it is shown in [3] that orchard networks are reconstructible from their so-called multiset distance matrices. See [3, Theorem 3.4]. We have no doubt that, over time, the class of orchard net- works will be realised to be reconstructible in other ways as well. The second remark concerns a related, but weaker, notion to that of ancestral tuples called ancestral sets. Let N be a phylogenetic network on X with vertex set V . For all x 2 X , the ancestral set of x is (x) = fv 2 V X : x is reachable from vg: Thus (x) is the set of non-leaf vertices v in N for which there is a directed path from v to x. Observe that, for all x 2 X , the root of N is always an element of (x) and so (x) is non-empty. Let denote the set f(x; (x)) : x 2 Xg of ordered pairs. Given  , it is clear that we can construct in time N N O(jVj). To see that ancestral sets is a weaker notion than ancestral tuples, consider the two orchard networks N and N shown in Fig. 5, where the non-leaf 1 2 vertices have been labelled 1; 2; : : : ; 8. For each i 2 f1; 2g, the ancestral sets ORCHARD NETWORKS 11 of x , x , and x are f1; 2; 3; 4; 5; 7g, f1; 2; : : : ; 8g, and f1; 2; 3g, respectively. 1 2 3 But N is not isomorphic to N . Note that, for a xed ordering of 1; 2; : : : ; 8, 1 2 the ancestral tuple of x di ers in N and N even though the ancestral 2 1 2 tuples of x and x are the same for N and N . Nevertheless, despite this 1 3 1 2 example, the ancestral sets of a phylogenetic network N do provide some information regarding the structure of N . As this is of possible independent interest, we highlight this in the next section where the preliminary lemmas are established in terms of ancestral sets. The third remark concerns the relationship between orchard networks and the increasingly prominent class of tree-based networks [8]. A phylogenetic network N on X with root  and vertex set V is tree-based if it has, as a subgraph, a rooted subtree with root , vertex set V , and leaf set X . Note that  in the subtree may have out-degree one. It is shown in [10] that the class of orchard networks is a proper subclass of tree-based networks. To see that it is proper, observe that the non-orchard networks N and N in 1 2 Fig. 4 are both tree-based. Thus, the networks in this gure also show that Theorem 2.2 does not extend to tree-based networks. 3. Preliminary Lemmas In this section, we establish several results that will be used in the proof of Theorem 2.2. These results show that the ancestral sets, and thus the ances- tral tuples, of an arbitrary phylogenetic network recognise and distinguish cherries and reticulated cherries. Lemma 3.1. Let N be a phylogenetic network on X , and let a and b be distinct elements in X . Then (a)  (b) if and only if the parent of b is reachable from the parent of a. Proof. Let p and p denote the parents of a and b, respectively. If p is a b b reachable from p , then it is clear that (a)  (b). To prove the converse, suppose that (a)  (b). Then p 2 (b) and so, by de nition, b is reachable from p . In turn, this implies that p is reachable from p . a b a The next corollary immediately follows from Lemma 3.1 and the fact that phylogenetic networks are acyclic. Corollary 3.2. Let N be a phylogenetic network on X , and let fa; bg be a 2- element subset of X . Then fa; bg is a cherry in N if and only if (a) = (b). Lemma 3.3. Let N be a phylogenetic network on X , and let fa; bg be a 2-element subset of X . Then fa; bg is a reticulated cherry of N in which b is the reticulation leaf if and only if  } 12 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL (i) (a)( (b), (ii) there is no x 2 X b such that (a)  (x), and (iii) (b) (x) = 1. x2Xb Proof. Let p and p denote the parents of a and b, respectively. It is easily a b checked that if fa; bg is a reticulated cherry in which b is the reticulation leaf, then (i){(iii) hold. So suppose that (i){(iii) hold. Since (i) holds, it follows by Lemma 3.1 that there is a directed path P in N from p to p . a b If p is a tree vertex, then N has a leaf, c say, reachable from p such that b b c 6= b. This implies that (a)  (c), contradicting (ii). Therefore p is a reticulation. Lastly, assume (p ; p ) is not an arc in N . Let u denote the vertex on P immediately prior to p . If u is a tree vertex, then N has a leaf 0 0 c 6= b reachable from u with (a)  (c ), contradicting (ii). On the other hand, if u is a reticulation, then (b) (x)  2; x2Xb contradicting (iii). Thus (p ; p ) is an arc and so fa; bg is a reticulated cherry a b in which b is the reticulation leaf. 4. Order Does Not Matter LetN be an orchard network. Then, by de nition, there exists a complete cherry-reduction sequence for N . But, how do we nd such a sequence and does the order in which we apply the cherry reductions matter? The next proposition says that if we take N and repeatedly apply cherry reductions until no more is possible, we always construct a complete cherry-reduction sequence. A vertex on a directed path is non-terminal if it is neither the rst nor last vertex on the path. Proposition 4.1. Let N be an orchard network, and let (2) N = N ;N ;N ; : : : ;N 0 1 2 ` be a maximal sequence of cherry reductions. Then this sequence is complete. Proof. Let X denote the leaf set of N , and suppose (2) is not complete. Paralleling (2), we begin by constructing a sequence N = M ;M ;M ; : : : ;M 0 1 2 of rooted acyclic directed graphs as follows. If N is obtained from N by 1 0 reducing a leaf of a cherry, then M is obtained from M by deleting the 1 0 same leaf but not suppressing the resulting vertex of in-degree one and out- degree one. Similarly, if N is obtained from N by cutting a reticulated 1 0 ORCHARD NETWORKS 13 cherry, then M is obtained from M by deleting the same reticulation 1 0 arc but not suppressing the two resulting vertices of in-degree one and out- degree one. More generally, if N is obtained from N by reducing a leaf of i i1 a cherry, that is, deleting a leaf b say and suppressing its parent p , then M b i is obtained from M by deleting b as well as deleting every non-terminal i1 vertex on the (unique) path from p to b in M . Note that each of these b i1 non-terminal vertices has in-degree one and out-degree one in M . On i1 the other hand, if N is obtained from N by cutting a reticulated cherry, i i1 that is, deleting a reticulation arc (p ; p ) and suppressing p and p , then a b a b M is obtained from M by deleting (p ; p ). Observe that, for all i, if we i i1 a b suppress every vertex in M of in-degree one and out-degree one, we obtain N . Thus M is a subdivision of N for all i, that is, N can be obtained i i i i from M by suppressing all vertices of in-degree one and out-degree one for all i. Furthermore, as (2) is not complete, the root  of N is never deleted and so, for all i, the root of M is also  and has out-degree two in M . i i We now analyse M . Since (2) is maximal and not complete, N has ` ` at least one reticulation. This implies that M has at least one vertex of in-degree two and out-degree one. We next show that every non-terminal vertex in M on a path from  to a vertex of in-degree two and out-degree one has degree three. 4.1.1. Let v be a vertex of in-degree two and out-degree one in M . If u is a non-terminal vertex of M on a path in M from  to v, then u has ` ` degree three in M . Proof. Suppose u is a vertex of in-degree one and out-degree one on a path from  to v in M . In N , the vertex u has degree three. Therefore, for some i 2 f1; 2; : : : ; `g, we have that N is obtained from N by a cherry i i1 reduction in which an arc incident with u is deleted. Now, as v is a vertex of in-degree two and out-degree one in M , it follows that v is a reticulation in N , and therefore a reticulation in N . Thus there is a path P in N from ` i i u to v. It is now easily checked that no cherry reduction applied to N i1 in which an arc incident with u and not lying on P is deleted is possible. Hence u has degree-three. We now complete the proof of the proposition. Since N is orchard, there is a sequence 0 0 0 0 N = N ;N ;N ; : : : ;N 0 1 2 k of cherry reductions such that N consists of a single vertex. Let i be the 0 0 smallest index such that N is obtained from N by cutting a reticulated i i1 cherry in which the deleted reticulation arc, (u; v) say, has the property that v is in M and it has in-degree two and out-degree one in M . Observe that, ` ` by the choice of i, no vertex of in-degree two and out-degree one is reachable  } 14 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL from v in M except v itself. As (2) is maximal, this implies that there is a unique vertex, ` say, in X that is reachable from v in M . v ` Now, u is a tree vertex in N whose other child, in addition to v, is i1 a leaf. By (4.1.1), u has degree-three in M . Furthermore, as u is a tree vertex in N , it follows that u has in-degree one and out-degree two in i1 M . Let w denote the child of u in M that is not v. At least one vertex in ` ` X is reachable from w in M and this vertex is not ` . If, in M , there is ` v ` no vertex reachable from w with in-degree two and out-degree one, then (2) is not maximal. Therefore, in M there is such a vertex w reachable from w. In N , the vertex w is a reticulation, and so there is a j 2 f1; 2; : : : ; kg 0 0 such that N is obtained from N by cutting a reticulated cherry in which j j1 a reticulation arc directed into w is deleted. Since (u; v) is the reticulation arc directed into v that is deleted, it follows j < i. But, by the choice of i, we have i < j; a contradiction. We conclude that (2) is complete. The following corollary is an immediate consequence of Proposition 4.1. Corollary 4.2. Let N be an orchard network, and let fa; bg be a cherry or a reticulated cherry of N . If N is obtained from N by reducing b if fa; bg is a cherry or cutting fa; bg if fa; bg is a reticulated cherry, then N is an orchard network. Since deciding if a given pair of leaves of a phylogenetic network is either a cherry or a reticulated cherry takes constant time and a cherry reduction also takes constant time, the last corollary gives a polynomial-time algorithm for deciding if an arbitrary phylogenetic network N is orchard. In particular, repeatedly nd a cherry or a reticulated cherry, and apply the appropriate cherry reduction until this process is no longer possible. This takes at most O(jVj) iterations, where V is the vertex of N . If at the completion of this process, we have a phylogenetic network consisting of a single vertex, then N is orchard; otherwise, N is not orchard. Observe that if N is orchard with n leaves and k reticulations, then this process consists of n + k 1 cherry reductions. 5. Proof of Theorem 2.2 In this section, we prove Theorem 2.2. For a phylogenetic network N , Corollary 3.2 and Lemma 3.3 show that it is straightforward to recognise cherries and reticulated cherries of N using only the ancestral sets, and thus the ancestral tuples, of N . This fact is freely used throughout this section. We next describe two operations on tuples that parallel the operations of reducing a cherry and cutting a reticulated cherry. ORCHARD NETWORKS 15 Let X be a non-empty nite set and, for some xed t, let = f(x; (x)) : x 2 Xg be a set of ordered pairs, where, for all x 2 X , we have that (x) is a t-tuple whose entries are either non-negative integers or . Note that the symbol is going to be used as a placeholder. Let fa; bg be a 2-element subset of X . The rst operation will be used only in association with reducing b when fa; bg is a cherry. Let j 2 f1; 2; : : : ; tg such that  (a) =  (b) = 1, j j but  (x) = 0 for all x 2 X fa; bg. Let  be the set of jX bj ordered pairs obtained from  as follows. For all x 2 X b, set  (x) so that the i-th entry is (x); if i 6= j; (x) = ; if i = j. 0 0 0 Set  = f(x;  (x)) : x 2 X bg. We say that  has been obtained from by reducing b. The second operation will be used only in association with cutting fa; bg when fa; bg is a reticulated cherry in which b is the reticulation leaf. Let j 2 f1; 2; : : : ; tg such that  (a) = 1 =  (b) but  (x) = 0 for all x 2 Xfa; bg, j j j and let k 2 f1; 2; : : : ; tg such that  (b) = 1 but  (x) = 0 for all x 2 Xb. k k Let  be the set of jXj ordered pairs obtained from  as follows. For all x 2 X b, set  (x) so that the i-th entry is (x); if i 62 fj; kg; (x) = ; if i 2 fj; kg; and set  (b) so that the i-th entry is (b)  (a); if i 62 fj; kg; i i (b) = ; if i 2 fj; kg. 0 0 0 Set  = f(x;  (x) : x 2 Xg. We say that  has been obtained from  by cutting fa; bg. Lemma 5.1. Let N be a phylogenetic network on X with vertex set V and jXj  2, and x an ordering of V X . Let fa; bg be a 2-element subset of X . (i) If fa; bg is a cherry of N , then, up to entries with symbol , the set of ordered pairs obtained from  by reducing b is the ancestral pro le of the phylogenetic network N obtained from N by reducing b. (ii) If fa; bg is a reticulated cherry of N in which b is the reticulation leaf, then, up to entries with symbol , the set of ordered pairs obtained from  by cutting fa; bg is the ancestral pro le of the phylogenetic network N obtained from N by cutting fa; bg.  } 16 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL Proof. We prove the lemma for (ii). The proof of the lemma for (i) is similar, but easier, and omitted. Suppose fa; bg is a reticulated cherry of N in which b is the reticulation leaf, and N is obtained from N by cutting fa; bg. Let be the set of ordered pairs obtained from  by cutting fa; bg. We will show that  is the ancestral pro le of a phylogenetic network isomorphic to N . Let V denote the vertex set of N , and x an ordering v ; v ; : : : ; v of the 1 2 t vertices in V X . Let p and p denote the parents of a and b, respectively, a b in N . Set U = fv 2 V X :  (a) = 1 =  (b),  (x) = 0 for all x 2 X fa; bgg a j j j j and U = fv 2 V X :  (b) = 1,  (x) = 0 for all x 2 X bg: b k k k Observe that U and U are both non-empty as p 2 U and p 2 U , but a a a b b b U \ U is empty. 0 0 Now consider  . To obtain  from  , we chose (i) an entry in (a), say j, such that  (a) = 1 =  (b) but  (x) = 0 for all x 2 X fa; bg, j j j and (ii) an entry in (b), say k, such that  (b) = 1 but  (x) = 0 for all k k x 2 X b. In particular, these chosen entries correspond to vertices, v and v say, in U and U , respectively. k a b Let N denote the phylogenetic network obtained from N by bijectively relabelling the vertices in U with the vertices in U so that p is relabelled a a a v , and bijectively relabelling the vertices in U with the vertices in U so that j b b p is relabelled v . Clearly, N is isomorphic to N and  is the ancestral b k 1 N pro le of N . Furthermore, it is easily checked that, up to isomorphism, is the ancestral pro le of the phylogenetic network N obtained from N by 0 0 cutting fa; bg. But N is isomorphic to N , thereby completing the proof of the lemma. With Lemma 5.1 in hand, we next prove the uniqueness part of Theo- rem 2.2 Proof of the uniqueness part of Theorem 2.2. The proof is by induction on the sum of the number n of leaves and the number k of reticulations in N . If n + k = 1, then n = 1 and k = 0, and N consists of the single vertex in X , and so uniqueness holds. If n + k = 2, then, as N is orchard, n = 2 and k = 0, in which case, N consists of two leaves attached to the root. Again, uniqueness holds. Now suppose that n + k  3 and the uniqueness holds for all orchard networks for which the sum of the number of leaves and the number of reticulations is at most n + k 1. Note that, as N is orchard, n  2. ORCHARD NETWORKS 17 Since N is orchard, it has either a cherry or a reticulated cherry. Thus, by Corollary 3.2 and Lemma 3.3, it is possible to nd a 2-element subset fa; bg of X using only  such that fa; bg is either a cherry or a reticulated cherry of N . If the latter, we can also determine from  which of a and b is the reticulation. Without loss of generality, we may assume b is the reticulation leaf. Depending on whether fa; bg is a cherry or a reticulated cherry, let N be obtained from N by reducing b or cutting fa; bg, respectively, and let be the set of ordered pairs obtained from  by reducing b or cutting 0 0 fa; bg, respectively. Regardless of the way N and  are obtained, it follows by Corollary 4.2 and Lemma 5.1 that N is an orchard network and, up to 0 0 0 isomorphism,  is the ancestral pro le of N . Furthermore, N has either n 1 leaves and k reticulations if fa; bg is a cherry, or n leaves and k 1 reticulations if fa; bg is a reticulated cherry. Therefore, by the induction as- sumption, up to isomorphism, N is the unique phylogenetic network whose ancestral pro le is  . Now let N be a phylogenetic network on X such that  is the ancestral 1 N pro le of N . Note that N has the same number of non-leaf vertices as N , 1 1 but not necessarily the same number of reticulations. First assume fa; bg is a cherry of N . Then, by Corollary 3.2, fa; bg is a cherry of N . Let N denote the phylogenetic network obtained from N by reducing b. By 0 0 Lemma 5.1(i), up to isomorphism,  is the ancestral pro le of N . Thus, by 0 0 the induction assumption, N is isomorphic to N . Since fa; bg is a cherry of N and N , it follows that N is isomorphic to N . 1 1 Lastly, assume fa; bg is a reticulated cherry of N . Then, by Lemma 3.3, fa; bg is a reticulated cherry of N in which b is the reticulation leaf. Let N be the phylogenetic network obtained from N by cutting fa; bg. By 0 0 Lemma 5.1(ii), up to isomorphism,  is the ancestral pro le of N . Hence, 0 0 by the induction assumption, N is isomorphic to N . As fa; bg is a retic- ulated cherry of N and N in which b is the reticulation leaf, we have that N is isomorphic to N . This completes the proof of the uniqueness part of Theorem 2.2. 5.1. The algorithm. Let N be an orchard network on X , and let  denote the ancestral pro le of N . Called Orchard Tuple, we next describe an algorithm which takes as its input X and , and returns a phylogenetic network N on X that is isomorphic to N . The proof that the algorithm works correctly is essentially the same as that used to prove the uniqueness part of Theorem 2.2, and so it is omitted. The running time of the algorithm follows its description. 1. If jXj = 1, then return the phylogenetic network consisting of the single vertex in X .  } 18 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL 2. Else, nd a 2-element subset, fa; bg say, of X such that either (I) (a) = (b) or (II) (a)  (b), there is no x 2 X b with (a)  (x), and (b) (x) = 1: x2Xb (a) If fa; bg satis es (I) (in which case fa; bg is a cherry), then (i) Reduce b in  to give the set  of jX bj ordered pairs. 0 0 (ii) Apply Orchard Tuple to input X = Xb and  . Construct 0 0 N from the returned phylogenetic network N on X by subdi- viding the arc incident to a with a new vertex p , and adjoining a new leaf b via the new arc (p ; b). If jX j = 1, then set N a 1 to be the phylogenetic network consisting of the leaves a and b adjoined to the root. Return N . (b) Else, fa; bg satis es (II) (in which case fa; bg is a reticulated cherry and b is the reticulation leaf ). (i) Cut fa; bg in  to give the set  of jXj ordered pairs. (ii) Apply Orchard Tuple to X and  . Construct N from the returned phylogenetic network N on X by subdividing the arcs incident to a and b with new vertices p and p , respectively, and a b adding the new arc (p ; p ). Return N . a b 1 We now consider the running time of Orchard Tuple. The input to the algorithm is a set X and the ancestral pro le of an orchard network N on X whose entries are either a non-negative integer or the symbol . Let V denote the vertex set of N . As noted earlier, the set = f(x; (x)) : x 2 Xg can be determined from  in O(jVj) time. This is a preprocessing step and it will have no e ect on the theoretical running time. Except for when jXj 2 f1; 2g, in which case, Orchard Tuple runs in constant time, each iteration begins by nding a 2-element subset of X satisfying either (I) or 2 2 (II). This takes O(jXj jVj) time as there are O(jXj ) two-element subsets of X and each subset takes O(jVj) time to decide if is satis es either (I) or (II). Once such a 2-element is found, we construct  . Regardless of the 0 0 way  is constructed, this takes O(jXjjVj) time. When N is returned, we 3 2 augment to N in constant time, and so each iteration takes O(jXj jVj ) time. When we recurse,  is the ancestral pro le of an orchard network with either one less leaf or one less reticulation than an orchard network for which is the ancestral pro le. Thus the total number of iterations is O(jVj). 3 3 We conclude that Orchard Tuple completes in O(jXj jVj ) time. This completes the proof of Theorem 2.2. ORCHARD NETWORKS 19 6. Conclusion The main result of this paper, Theorem 2.2, shows that the ancestral pro le of an orchard network N on X uniquely determines N amongst all phylogenetic networks on X . This generalises results in both [4] and [5], which considered tree-sibling time-consistent networks and tree-child net- works (subclasses of orchard networks whose number of reticulations is at most linear in the number of leaves). Curiously, these later results have a di erent motivation compared to what motivated Theorem 2.2. There the motivation is to construct a distance measure (metric) on the classes of tree-sibling time-consistent networks and tree-child networks which is com- putable in polynomial time. Recalling that they considered the equivalent notion of path-tuples, for two tree-sibling time-consistent (resp. tree-child) networks N and N , the distance between N and N is the value 1 2 1 2 j 4 j ; N N 1 2 where the symmetric di erence and the cardinality operator refer to mul- tisets. It is easily checked that this same measure extends to the class of orchard networks. As noted in the introduction, our result does not relate to speci c bi- ological data that is readily available at present. However, a type of data that might provide ancestral pro le information would be genomic fragments that follow lineage splitting and reticulation events, so that when a reticu- lation occurs, a trace of each fragment in the incoming lineage is preserved in (di erent regions of ) the reticulate genome. Lastly, we end with a question asked by one of the referees. For a given orchard network N , is it possible to count the number of complete cherry- reduction sequences of N ? Acknowledgements We thank the three anonymous referees for their careful reading of the paper and constructive comments. References [1] M. Baroni, M. Steel, Accumulation phylogenies, Annals of Combinatorics 10 (2006) 19{30. [2] M. Bordewich, K.T. Huber, V. Moulton, C. Semple, Recovering normal networks from shortest inter-taxa distance information, Journal of Mathematical Biology 77 (2018) 571{594.  } 20 PETER L. ERDOS, CHARLES SEMPLE, AND MIKE STEEL [3] M. Bordewich, C. Semple, Determining phylogenetic networks from inter-taxa dis- tances, Journal of Mathematical Biology 73 (2016) 283{303. [4] G. Cardona, M. Llabr es, F. Rossell o, G. Valiente, A distance metric for a class of tree-sibling phylogenetic networks 24 (2008) 1481{1488. [5] G. Cardona, F. Rossell o, G. Valiente, Comparison of tree-child phylogenetic networks, IEEE/ACM Transactions on Computational Biology and Bioinformatics 6 (2009) 552{569. [6] W.F. Doolittle, Phylogenetic classi cation and the universal tree, Science 284 (1999) 2124{2128. [7] J. Felsenstein, Inferring Phylogenies, Sinauer Associates, Sunderland, MA, 2004. [8] A. R. Francis, M. Steel, Which phylogenetic networks are merely trees with additional arcs?, Systematic Biology 64 (2015) 768{777. [9] P. Gambette, K.T. Huber, On encodings of phylogenetic networks of bounded level, Journal of Mathematical Biology 65 (2012) 157{180. [10] K.T. Huber, L. van Iersel, R. Janssen, M. Jones, V. Moulton, Y. Murakami, C. Semple, Rooting for phylogenetic networks, in preparation. [11] D.H. Huson, R. Rupp, C. Scornavacca, Phylogenetic Networks: Concepts, Algorithms and Applications, Cambridge University Press, 2010. [12] L. van Iersel, V. Moulton, Trinets encode tree-child and level-2 phylogenetic networks, Journal of Mathematical Biology 68 (2014) 1707{1729. [13] R. Janssen, Y. Murakami, Solving phylogenetic network containment problem using cherry-picking sequences, arXiv:1812.08065 (2018). [14] W. Jetz, G.H. Thomas, J.B. Joy, K. Hartmann, A.O. Mooers, The global diversity of birds in space and time, Nature 491 (2012) 444{448. [15] E.V. Koonin, The turbulent network dynamics of microbial evolution and the statis- tical tree of life, Journal of Molecular Evolution 80 (2015) 244{250. [16] F. Pardi, C. Scornavacca, Reconstructible phylogenetic networks: Do not distinguish the indistinguishable, PLoS Computational Biology 11 (2015) e1004135. [17] C. Semple, M. Steel, Phylogenetics, Oxford University Press, Oxford, 2003. [18] S.J. Willson, Reconstruction of certain phylogenetic networks from the genomes at their leaves, Journal of Theoretical Biology 252 (2008) 338{349. [19] S.J. Willson, Properties of normal phylogenetic networks, Bulletin of Mathematical Biology 72 (2010) 340{358. [20] S.J. Willson, Regular networks can be uniquely constructed from their trees, IEEE/ACM Transactions on Computational Biology and Bioinformatics 8 (2011) 785{796. ORCHARD NETWORKS 21 Alfred Renyi Institute of Mathematics, Hungarian Academy of Sciences, Budapest, Hungary E-mail address : erdos.peter@renyi.mta.hu School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand E-mail address : charles.semple@canterbury.ac.nz School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand E-mail address : mike.steel@canterbury.ac.nz

Journal

MathematicsarXiv (Cornell University)

Published: Jan 13, 2019

References