Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Exact and approximate limit behaviour of the Yule tree's cophenetic index

Exact and approximate limit behaviour of the Yule tree's cophenetic index In this work we study the limit distribution of an appropriately normalized cophenetic index of the pure{birth tree conditioned on n contemporary tips. We show that this normalized phylogenetic bal- ance index is a submartingale that converges almost surely and in L . We link our work with studies on trees without branch lengths and show that in this case the limit distribution is a contraction{type distribution, similar to the Quicksort limit distribution. In the contin- uous branch case we suggest approximations to the limit distribution. We propose heuristic methods of simulating from these distributions and it may be observed that these algorithms result in reasonable tails. Therefore, we propose a way based on the quantiles of the derived dis- tributions for hypothesis testing, whether an observed phylogenetic tree is consistent with the pure{birth process. Simulating a sample by the proposed heuristics is rapid, while exact simulation (simulating the tree and then calculating the index) is a time{consuming procedure. We conduct a power study to investigate how well the cophenetic in- dices detect deviations from the Yule tree and apply the methodology to empirical phylogenies. Keywords : Contraction type distribution; Cophenetic index; Martin- gales; Phylogenetics; Signi cance testing 1 Introduction Phylogenetic trees are now a standard when analyzing groups of species. They are inferred from molecular sequences by algorithms that often assume arXiv:1703.08954v3 [q-bio.PE] 30 Apr 2018 a Markov chain for mutations at the individual positions of the genetic se- quence (e.g. Ewens and Grant, 2005; Felsenstein, 2004; Yang, 2006). Given a phylogenetic tree it is often of interest to quantify the rate(s) of speciation and extinction for the studied species. To do this one commonly assumes a birth{death process with constant rates. However, the development of formal statistical tests whether a given tree comes from a given branching process model is an open area of research (see the still relevant \Work remaining" part at the end of Ch. 33 in Felsenstein, 2004). The reason for the apparent lack of widespread use of such tests (but see Blum and Fran cois, 2005) could be the lack of a commonly agreed on test statistic. This is as a tree is a complex object and there are multiple ways in which to summarize it in a single number. One proposed way of summarizing a tree is through indices that quantify how balanced it is, i.e. how close is it to a fully symmetric tree. Two such indices have been with us for many years now: Sackin's (Sackin, 1972) and Colless' (Colless, 1982). Alternatively, McKenzie and Steel (2000) proposed to measure balance by counting cherries on the tree and they showed that after appropriate centring and scaling, this index converges to the standard normal distribution (for examples of other indices see Ch. 33 in Felsenstein, 2004). Recently, a new balance index was proposed|the cophenetic index (Mir et al., 2013). The work here is inspired by private communication with evolutionary biologist Gabriel Yedid (current aliation Nanjing Agricultural University, Nanjing, China) concerning the usage of the cophenetic index for signi cance testing of whether a given tree is consistent with the pure{ birth process. He noticed that simulated distributions of the index have much heavier tails than those of the normal and t distributions and hence, comparing centred and scaled cophenetic indices with the usual Gaussian or t quantiles is not appropriate for signi cance testing. It would lead to a higher false positive rate|rejecting the null hypothesis of no extinction when a tree was generated by a pure{birth process. Our aim here is to propose an approach for working analytically with the cophenetic index, especially to improve hypothesis tests for phylogenetic trees, i.e. how to recognize if the tree is out of the \Yule zone" (Yang et al., 2017). We show that there is a relationship between the cophenetic index and the Quicksort algorithm. This suggests that the methods exploring (e.g. Fill and Janson, 2000, 2001; Janson, 2015) the limiting distribution of the Quicksort algorithm can be an inspiration for studying analytical properties 2 of the cophenetic index. The paper is organized as follows. In Section 2 we formally de ne the cophenetic index (for trees with and without branch lengths) and present the most important results of the manuscript. We de ne an associated sub- martingale that converges almost surely and in L (Thm. 2.4), propose an elegant representation (Thm. 2.7) and a very promising approximation (Def. 2.8). Afterwards in Section 3, we show that in the discrete setting the limit law of the normalized cophenetic index is a contraction{type distribution. Based on this we propose alternative approximations to the limit law of the normalized (with branch lengths) cophenetic index. In Section 4 we describe heuristic algorithms to simulate from these limit laws, show simulated quan- tiles, explore the power of the cophenetic index to recognize deviations from the Yule tree (comparing with Sackin's and Colless' indices' powers), and ap- ply the indices to example empirical data. In Section 5 we prove the claims presented in Section 2 alongside other supporting results. Then, in Section 6 we study the second order properties of this decomposition and conjecture a Central Limit Theorem (CLT, Rem. 6.10). We end the paper with Section 7 by describing alternative representations of the cophenetic index. 2 The cophenetic index and summary of main results Mir et al. (2013) recently proposed a new balance index for phylogenetic trees. De nition 2.1 (Mir et al. (2013)) For a given phylogenetic tree on n tips and for each pair of tips (i; j) let  be the number of branches from the root ij to the most recent common ancestor of tips i and j. We then the de ne the discrete cophenetic index as (n) (n) ~ ~ =  : ij 1i<jn Mir et al. (2013) show that this index has a better resolution than the \tra- ditional" ones. In particular the cophenetic index has a range of values of 3 2 the order of O(n ) while Colless' and Sackin's ranges have an order of O(n ). (n) Furthermore, unlike the other two previously mentioned,  makes mathe- matical sense for trees that are not fully resolved (i.e. not binary). 3 In this work we study phylogenetic trees with branch lengths and hence consider a variation of the cophenetic index. De nition 2.2 For a given phylogenetic tree on n tips and for each pair of tips (i; j) let  be the time from the most recent common ancestor of tips i ij and j to the root/origin (depending on the tree model) of the tree. We then de ne the continuous cophenetic index as (n) (n) =  : ij 1i<jn Remark 2.3 In the original setting, when the distance between two nodes was measured by counting branches, Mir et al. (2013) did not consider the edge leading to the root. In our work here, where our prime concern is with trees with random branch lengths, we include the branch leading to the root. This is not a big di erence, one just has to remember to add to each distance between nodes the same exponential (exp(1)|parametrization by the rate) random variable (see Section 5 for description of the tree's growth). The results of the present manuscript are built around a scaled version of the cophenetic index which is an almost surely and L convergent submartin- gale. We rst introduce some notation. Let Y be the {algebra containing all the information on the Yule with n tips tree and de ne H := 1=k : n;m k=1 Below we present the main results concerning the cophenetic index, leaving the proofs and supporting theorems for Section 5. Theorem 2.4 Consider a scaled cophenetic index (n) W =  : W is a positive submartingale that converges almost surely and in L to a nite rst and second moment random variable. (n) De nition 2.5 For k = 1; : : : ; n1 let us de ne 1 as the indicator random variable taking the value of 1 if a randomly sampled pair of species coalesced at the k{th (counting from the origin of the tree) speciation event. 4 We know (e.g. Bartoszek and Sagitov, 2015b; Stadler, 2009; Steel and McKen- zie, 2001) that h i n + 1 1 (n) (n) P(1 = 1) = E 1 = 2   : (1) n;k k k n 1 (k + 1)(k + 2) De nition 2.6 For i = 1; : : : ; n 1 let us introduce the random variable n1 h i (n) (n) V := E 1 jY : (2) i k k=i Theorem 2.7 W can be represented as n1 (n) W = V Z ; (3) n i i=1 where Z ; : : : ; Z are i.i.d. exp(1) random variables. 1 n1 De nition 2.8 De ne the random variable W as n1 h i (n) W = E V Z ; (4) n i i=1 where Z ; : : : ; Z are i.i.d. exp(1) random variables. 1 n1 Remark 2.9 Despite the apparent elegance, it is not straightforward to de- rive a Central Limit Theorem (CLT) or limit statements concerning W from the representation of Eq. (3). Initially one could hope (based on \typical" results on limits for randomly weighted sums, e.g. Thm. 1 of Rosalsky and Sreehari, 1998) that W could converge a.s. to a random variable that has the same limiting distribution as W . Similarly, as in the proof of Thm. 2.4 in Section 5, because ((n + 2)(n 1)=(n(n + 1)) > 1, we have that W is an L bounded submartingale (n + 2)(n 1) 2 E W jW = W + > W : n+1 n n n n(n + 1) n (n + 1) Hence, W converges almost surely. Figure 1 can easily mislead one to be- lieve in the equality of the limiting distributions of W and W . However, n n in Thm. 6.8 we can see that Var [W ] and Var W convergence to di er- n n ent limits. Therefore, W and W cannot converge in distribution to the n n same limit. However, as we shall see in Section 4, W provides a reasonable approximation (and importantly extremely cheap, in terms of computational time and memory) to W in the sense of their distributions. 0 2 4 6 8 10 Figure 1: The curves are density estimates, via R's (R Core Team, 2013) density() function, of W 's density (black) and W 's density (gray). They n n are based on simulated values of W from 10000 simulated 500{tips Yule trees with  = 1. To obtain a sample from W , independent exp(1) random variables were drawn. The simulated sample of W has mean 2, variance 1:214, skewness 1:609 and excess kurtosis 4:237 while the simulated sample of W has mean 1:973, variance 1:109, skewness 1:634 and excess kurtosis 4:159. It is obvious that E [W ] = E W , but we have shown that their n n variances di er (simulations agree with Thm. 6.8). De nition 2.10 We naturally de ne the scaled discrete cophenetic index as (n) ~ ~ W =  : (5) 0.0 0.1 0.2 0.3 0.4 0.5 2 Theorem 2.11 W is an almost surely and L convergent submartingale. The applied reader will be most interested in how the results here can be practically used. As written already in the Introduction balance indices are often used to provide a single{number summary of the tree's shape. Such statistics can be then used e.g. to test if the tree is consistent with some null model (here the Yule model). Naturally, there has been extensive work on using di erent balance indices for signi cance testing (e.g. Agapow and Purvis, 2002; Blum and Fran cois, 2005; Yang et al., 2017). However previous works nearly always worked with indices that only considered the topology and often obtained the rejection regions through direct simulations. Unfortunately, looking only at the tree's topology will not allow for dis- tinguishing between some models. In particular (as seen in Tab. 3) there is no di erence (from the topological indices perspective) between a Yule tree, a constant rate birth{death tree and a coalescent tree. Hence, a temporal index that also takes into account the branch lengths should be used (as in- dicated in the \Work remaining" section at the end of Ch. 33 in Felsenstein, (n) 2004). A statistic based on  performs signi cantly better (but in these (n) cases still leaves a lot to be desired). However,  shows it true useful- ness when employed to distinguish a biased speciation (Blum and Fran cois, 2005) from a Yule model. Blum and Fran cois (2005) indicated that there is a regime where topological indices fail completely. Table 3 shows that in this setup (and also certain others) the temporal index in superior in recognizing the deviation from the Yule tree. Directly simulating a tree from a null model (Yule here) and then calcu- lating the index will of course give a sample from the correct null distribution. However, this approach is costly both in terms of time and memory. There- fore, if theoretical results that provide equivalent, asymptotic or approximate representations of the index's law are available they could speed up any study by orders of magnitude. In fact this is clearly visible in Tab. 1, calculating the cophenetic index directly from a sample of simulated pure{birth trees is over 170 times slower than considering W . Even more dramatically one can obtain a sample from an approximation to the equivalent representation of (n) the asymptotic distribution of  (after normalization) nearly 3000 times faster than directly sampling the discrete cophenetic index. In Alg. 1 we describe how the presented here approach can be used for signi cance testing. Then, in Section 4 we discuss in detail the required computational procedures, present simulation results concerning the power of 7 the tests and apply the tests to empirical data. Preceding this computational (n) Section is a characterization of the limit distribution of (normalized)  and (n) another proposal to approximate the limit of (normalized)  . This section justi es the described simulation algorithms in Section 4. Algorithm 1 Signi cance testing (n) 1: input: A phylogenetic tree T with n tips and signi cance level (n) 2: output: A decision if the null hypothesis of T coming from a pure{ birth process is rejected (TRUE) or not (FALSE) 3: Correct, when necessary, the tree for the speciation rate, by multiplying all branch lengths by , if cophenetic index with branch lengths is used. . See Section 4.2. (n) 4: Calculate , T 's cophenetic index . Exactly which version is used, (n) (n) (n) (n) ~ ~ ,  ,  ,  , depends on the particular tree, if it has branch NRE NRE lengths or root edge 5: Standardize  as X = ( E [])= Var [] . E [] and Y ule Y ule Y ule Var [] depend which version of the cophenetic index is considered. Y ule In Thm. 2.12 all the possibilities are presented. 6: Obtain the quantiles q ( =2), q (1 =2) (if test is two{sided), Y ule Y ule q ( ) (left{tailed), q (1 ) (right{tailed) of X under the Yule Y ule Y ule model, i.e. P (X  q ( )) = . . Exactly how to Y ule obtain the quantiles is a matter of which version of the cophenetic index is used and computational resources (see Section 4). 7: if X is inside rejection region then return TRUE 8: else 9: return FALSE 10: end if Theorem 2.12 A random variable with subscript NRE (no root{edge) indi- cates that this random variable comes from a tree lacking the edge leading to 8 the root. n 2(nH ) (n) n;1 2 E  =  n 2 n1 h i (n) 2(nH ) n;1 1 2 E  = 1  n NRE 2 n1 2 ( ) (n) 2 2 2 4 3 Var  = (12n (n 6n 4)H 9n + 102n 2 2 n1;2 9n (n1) +51n 24nH 72n 72) n1;1 (6) 2 4 (2 9) n h i 2 (n) ( ) 2 2 2 4 3 Var  = (12n (n 6n 4)H 9n + 102n 2 2 n1;2 NRE 9n (n1) +51n 24nH 72n 72) n1;1 1 2 4 (2 18) n h i n 4(nH ) (n) n;1 2 E  = 1  3n =2 2 n1 h i (n) 4(nH ) n n;1 E  = 2  n NRE 2 n1 h i (n) 1 4 3 2 2 4 Var  = (n 10n + 131n 2n) 4n H 6nH  n =12 n;2 n;1 h i (n) 1 4 3 2 2 4 Var  = (n 10n + 131n 2n) 4n H 6nH  n =12 n;2 n;1 NRE 12 (7) Proof The proof of the expectation part is due to Mir et al. (2013); Sagitov (n) and Bartoszek (2012). The variance of  is due to Cardona et al. (2013); (n) Mir et al. (2013). The variance of  is a consequence of the lemmata and theorems presented in Section 6. When the root edge is not included, then we have to decrease the expectation by . This is due to each pair of tips \having" the root edge included in the cophenetic distance between them. In the case of branch lengths, the expectation of the root edge, distributed as exp(1), is one. Without a root edge for the same reason the variance of (n) has to be decreased by . In the discrete case the root edge has a deterministic length of 1 and hence no e ect on the variance. 3 Contraction{type limit distribution Even though the representation of Eq. (3) is a very elegant one, it is not obvious how to derive asymptotic properties of the process from it (compare 9 Section 6). We turn to considering the recursive representation proposed by Mir et al. (2013) L R (n) (L ) (R ) n n n n ~ ~ ~ =  +  + + ; (8) NRE NRE NRE 2 2 where L and R are the number of left and right daughter tip descendants. n n Obviously L + R = n. n n From Eq. (8) we will be able to deduce the form of the limit of the process. In the case with branch lengths we attempt to approximate the cophenetic index with the following contraction{type law L R (n) (L ) (R ) n n n n 0 =  +  + T + T ; (9) 0:5 NRE NRE NRE 0:5 2 2 where T , T are independent exp(2) random variables (we index with the 0:5 0:5 mean to avoid confusion with T , Section 5, the time between the second and third speciation event which is also exp(2) distributed). These are the branch lengths leading from the speciation point. The rationale behind the choice of distribution is that a randomly chosen internal branch of a conditioned Yule tree with rate 1 is exp(2) distributed (Cor. 3:2 and Thm. 3:3 Stadler and Steel, 2012). This is of course an approximation, as we cannot expect that the laws of the branch lengths with the depth of the recursion should become indistinguishable from the law of the average branch. In fact, we should expect that the law of Eq. (9) has to depend on n, i.e. the level of the recursion. For larger n the branches have distributions concentrated on smaller values, e.g. compare the randomly sampled root adjacent branch length law (Thm. 5:1 Stadler and Steel, 2012) with the law of the average branch length. However, as we shall see simulations indicate that approximating with the average law still could still yield acceptable heuristics, but not as good as (n) (n) the approximation by W . We use the notation  ,  to di erentiate NRE NRE (n) (n) from  ,  where the root branch is included, i.e. n n (n) (n) (n) (n) ~ ~ =  + and  =  + T ; where T  exp(1): 1 1 NRE NRE 2 2 De ne now 10  h i  h i (n) (n) (n) (n) (n) 2 (n) 2 ~ ~ ~ Y = n  E  Y = n  E NRE NRE NRE NRE and using Eqs. (6) and (7) we obtain the following recursions 2 2 L R (n) L (L ) R (R ) 2 n 2 n 0 n n n n Y = Y + Y + n T + n T 0:5 0:5 n  h n i h 2 i h 2i (L ) (R ) (n) 2 n n +n E  jL + E  jR E n n NRE NRE NRE and 2 2 L R L R (n) n (L ) n (R ) 2 n 2 n ~ ~ n ~ n Y = Y + Y + n + n n n 2 2 h i h i h i (L ) (R ) (n) n n ~ ~ ~ +n E  jL + E  jR E  : n n NRE NRE NRE (n) ~ ~ The process Y is related to the process W as h i (n) 1 (n) ~ ~ ~ W = 2(1 + n )Y + E  : NRE In the continuous case we do not have an exact equality, we rather hope for h i (n) 1 (n) W  2(1 + n )Y + E  + T n 1 NRE in some sense of approximation. Hence, knowledge of the asymptotic be- (1) (1) (1) haviour of Y , Y will immediately give us information about W , (1) W in the obvious way (1) (1) ~ ~ W = 2Y + 2 (1) (1) W  2Y + 1 + T : (n) (n) The processes Y , Y look very similar to the scaled recursive represen- tation of the Quicksort algorithm (e.g. R osler, 1991). In fact, it is of interest that, just as in the present work, a martingale proof rst showed convergence of Quicksort (R egnier, 1989), but then a recursive approach is required to show properties of the limit. The random variable L =n !   Unif [0; 1] weakly and as weak convergence is preserved under continuous transfor- mations (Thm. 18, p. 316 Grimmett and Stirzaker, 2009) we will have 11 2 2 (L =n) !  weakly. Therefore, we would expect the almost sure limits to satisfy the following equalities in distribution (remembering the asymptotic behaviour of the expectations) 0 00 1 1 (1) 2 (1) 2 (1) 2 2 0 Y =  Y + (1  ) Y +  T + (1  ) T  (1  ); (10) 0:5 0:5 2 2 and 0 00 1 (1) 2 (1) 2 (1) ~ ~ ~ Y =  Y + (1  ) Y + 3 (1  ) (11) 0 00 (1) (1) (1) where  is uniformly distributed on [0; 1], Y , Y and Y are identi- 0 00 0 (1) (1) (1) (1) ~ ~ ~ cally distributed random variables, so are Y , Y and Y , and Y , 00 0 00 (1) (1) (1) ~ ~ Y , Y and Y are independent. Following R osler (1991)'s approach it turns out that the limiting distributions do satisfy the equalities of Eqs. (10) and (11). Let D be the space of distributions with zero rst moment and nite second moment. We consider on D the Wasserstein metric d(F; G) = inf kX Yk 2: XF;YG Theorem 3.1 Let F 2 D and assume that Y; Y  F ,   Unif [0; 1], 0 0 0 T ; T  exp(2) and Y; Y ; ; T; T are all independent. De ne transfor- 0:5 0:5 mations S : D ! D, S : D ! D as 1 2 1 1 2 2 0 2 2 0 S (F ) =  Y + (1  ) Y +  T + (1  ) T  (1  ); (12) 1 0:5 0:5 2 2 and 0 1 2 2 S (F ) =  Y + (1  ) Y + 3 (1  ) (13) respectively. Both transformations S and S are contractions on (D; d) and 1 2 converge exponentially fast in the d{metric to the unique xed points of S and S respectively. Remark 3.2 The proof of Thm. 3.1 is the same as R osler (1991)'s proof of his Thm. 2:1. However, compared to the Quicksort algorithm (R osler, 1991) 12 p p we will have a 2=5 upper bound on the rate of decay instead of 2=3. This 2 2 speed{up should be expected as we have  and (1  ) instead of  and (1 ). Thm. 3.1 can also be seen as a consequence of R osler (1992)'s more general Thms. 3 and 4. The rate of convergence is also a consequence of the general contraction lemma (Lemma 1, R osler and Ruschendorf,  2001). Now, using Lemmata 7.1, 7.2 (their proofs in 7.2 di er only in detail from the proof of Prop. 3:2 in R osler, 1991) and arguing in the same way as R osler (1991) did in his Section 3, especially his proof of his Thm. 3:1 we obtain (n) (n) (1) (1) ~ ~ that Y and Y converge in the Wasserstein d{metric to Y and Y whose laws are xed points of S and S respectively. A minor point should 1 2 4 2 be made. Here, we will have (i=n) instead of (i=n) in a counterpart of R osler (1991)'s Prop 3:3. Remark 3.3 One may directly obtain from the recursive representation that h i (1) (1) (1) (1) ~ ~ E Y = EY = 0, Var Y = 1=16 = 0:0635 and Var Y = 1=12. We can therefore, see that in the discrete case the variance agrees. However, in the continuous case we can see that it slightly di ers Var [(W T )=2] !  =18 0:5  0:048: n 1 Remark 3.4 One can of course calculate what the mean and variance of 0 (1) (1) T , T should be so that E Y = 0 and Var Y = Var [(W T )=2]. 0:5 n 1 0:5 0 0 2 We should have E [T ] = E [T ] = 0:5 and Var [T ] = Var [T ] =  =3 0:5 0:5 0:5 0:5 25=8. This, in particular, means that these branch lengths cannot be expo- nentially distributed. We therefore, also experimented by drawing T , T 0:5 0:5 from a gamma distribution with rate equalling 1=(2( =3 25=8)) and shape equalling  =6 25=16. However, this signi cantly increased the duration of the computations but did not result in any visible improvements in compari- son to Tab. 1. 4 Signi cance testing 4.1 Obtaining the quantiles Algorithm 1 requires knowledge of the quantiles of the underlying distribution in order to de ne the rejection region. Unfortunately, an analytical form of 13 the density of any scaled cophenetic index is not known so one will have to resort to some sort of simulations to obtain the critical values. Directly simulating a large number of pure{birth trees can take an overly long time, measured in minutes (on a modern machine with a large amount of memory, or hours on an older one). Fortunately, the cophenetic index can be calculated in O(n) time (Cor. 3 Mir et al., 2013) and such a tree{traversing algorithm (n) (n) was employed to obtain  and  . On the other hand, the suggestive (but wrong) approximations of Eq. (4) and contraction limiting distributions Eqs. (10) and (11) are signi cantly faster to simulate, see Tab. 1. Simulating from the approximate Eq. (4) is straightforward. One simply draws n 1 independent exp(1) random variables. Simulating random vari- ables satisfying Eqs. (10) and (11) is more involved and it may be possible to develop an exact rejection algorithm (cf. Devroye et al., 2000). Here, we choose simple, approximate but still e ective, heuristics in order to demon- strate the usefulness of the approach for signi cance testing. We now describe algorithms (Algs. 2 and 3) for simulating from a more general distribution, F , that satis es 0 00 Y = g ( )Y + g ( )Y + C (; ); (14) 1 2 0 00 0 00 where Y; Y ; Y  F , Y ; Y ; ;  are independent,   F ,   F is some random vector, g ; g : R ! R and C : R ! R for some appropriate p that 1 2 depends on 's dimension. Of course in our case here we have   Unif [0; 1], 2 2 g ( ) =  , g ( ) = (1  ) , 1 2 0 2 2 0 C (; T; T ) =  T=2 + (1  ) T =2  (1  ) and C ( ) = 1=2 3 (1  ) (n) (n) 0 for  ,  respectively. Of course, T , T are independent and exp(2) dis- tributed. If one considers also the root edge, then to the simulated random 2 (n) variable one needs to add T  exp(1) when simulating n  or appropri- 2 (n) ately 1 if one considers n  . The recursion of Alg. 3 for a given realization of  and  random variables can be directly solved. However, from numerical experiments implementing Alg. 3 iteratively seemed computationally ine ective. In Tab. 1 we report on the simulations from the di erent distribution. For each distribution we draw a sample of size 10000 and repeat this 100 14 Algorithm 2 Population approximation 1: Initiate population size N 2: Set P [0; 1 : N ] = Y . Initial population 3: for i = 1 to i do max 4: f =density(P [i 1; ]) . density estimation by R i1 5: for j = 1 to N do 6: Draw  from F 7: Draw  from F 8: Draw Y , Y independently from f 1 2 i1 9: P [i; j] = g ( )Y + g ( )Y + C (; ) 1 1 2 2 10: end for 11: end for 12: return P [i ; ] max 13: . Add root branch (exp(1) or 1) if needed for each individual. Algorithm 3 Recursive approximation 1: procedure Yrecursion(n, Y ) 2: if n = 0 then 3: Y = Y , Y = Y 1 0 2 0 4: else if n = 1 and Y = 0 then 5: Draw  ,  independently from F 1 2 6: Draw  ,  independently from F 1 2 7: Y = C ( ;  ) 1 1 1 8: Y = C ( ;  ) 2 2 2 9: else 10: Y =Yrecursion(n 1 , Y ) 1 0 11: Y =Yrecursion(n 1 , Y ) 2 0 12: end if 13: Draw  from F 14: Draw  from F 15: return g ( )Y + g ( )Y + C (; ); 1 1 2 2 16: end procedure 17: return Yrecursion(N , Y ) 18: . Add root branch (exp(1) or 1) if needed. 15 times. We compare the quantiles from the di erent distributions. We can see that the approximation of W for W is a good one and can be used n n when one needs to work with the distribution of the cophenetic index with branch lengths. In the case of the discrete cophenetic index we have found an exact limit distribution which is a contraction{type distribution. Therefore, one can relatively quickly simulate a sample from it without the need to do lengthy simulations of the whole tree and then calculations of the cophenetic index. Unfortunately, this contraction approach does not seem to give such good results in the Yule tree with branch lengths case. We used an approxi- mation when constructing the contraction. Instead of taking the law of the length of two daughter branches, we took the law of an random internal branch. This induces a di erence between the tails of the distributions that is clearly visible in the simulations. Even at the second moment level there is a di erence. We calculated (Thm. 6.8) that Var [W ] ! 2 =9 1  1:193, 2 (n) Var W ! 4 =3 12  1:159 while Var 2Y + T = 1:25. Therefore, n 1 the approximation by W seems better already at the second moment level. Generally if one cannot a ord the time and memory to simulate a large sam- ple of Yule tree, simulating W values seems an attractive option, as the discrepancy between the two distributions seems small. In Fig. 2 we compare the density estimates of (scaled and centred) both continuous and discrete branches cophenetic indices and their respective contraction{type limit distributions. The density estimates generally agree (n) but we know from Tab. 1 that for  this is only an approximation. We simulated 10000 Yule trees and hence we report only the quantiles between 2:5% and 97:5%. Quantiles further out in the tails seemed less accurate and hence are not included in the table. Similarly, we can see less correspondence between the di erent estimates of kurtosis. This statistic relies on fourth mo- ments and hence is more sensitive to the tails. On the other hand we can see much greater Monte Carlo error for the kurtosis in all simulations, including the setup where the values are extracted directly from Yule trees. The values (n) for  seem more similar to values from the Yule tree. We should expect this as here we have shown an exact limit distribution. An overall assessment of the quantiles is given by the root{mean{square{ error (RMSE) row in Tab. 1. We consider the quantiles at the = 0:001, = 0:005, = 0:01, = 0:025, = 0:05, = 0:95, = 0:975, 2 3 4 5 6 7 16 = 0:99, = 0:995, = 0:999 levels. The RMSE is de ned as 8 9 10 ! ! 1 100 5 10 2 X X X 2 2 RMSE = ( ) (q ^ ( ) q( )) + ( ) (q ^ ( ) q( ))  (0:1) i i1 j i i i+1 i j i i j=1 i=1 i=6 (15) with dummy levels = 0 and = 1. The (0:1) normalizes the whole 0 11 mean{square{error. We only look at the error at the tails, so we correct by the fraction of the distributions' support that we consider. As a proxy for the true quantiles we take the pooled values (as explained in Tab. 1) from the \Yule columns". The j index runs over the 100 repeats of the simulations. The RMSE, when using W , seems to be on the level of the RMSE of (N) the \direct simulations". Y has an error of about twice the size (both (N) simulation methods). Looking at Y one can see that the RMSE is exactly on the level of the \Yule column's" RMSE. This is even though we used a recursion of level 10, while an exact match of distributions should take place in the limit (in nite depth recursion). However, the rapid, exponential con- vergence of the contraction seems to make any di erences invisible, already at this recursion level. 4.2 Power of the tests For a given test statistic to be useful one also needs to know its power, the ability to reject the null hypothesis (here Yule tree) when a given alterna- tive one is true. For example, balance indices based only on topology like (n) Sackin's, Colless' or  cannot be expected to di erentiate between any trees that are generated by di erent constant rate birth{death processes or by the coalescent. The rationale behind this is that the topologies induced by the n contemporary species (i.e. we forget about lineages leading to ex- tinct ones) are stochastically indistinguishable no matter what the death or birth rate is (Thm. 2.3, Cor. 2.4 Gernhard, 2008). Similarly, regarding the coalescent at the bottom of their p. 93 Steel and McKenzie (2001) write \: : :, one has the coalescent model [1,18,19]. In this model one starts with n objects, then picks two at random to coalesce, giving n 1 objects. This process is repeated until there is only a single object left. If this process is reversed, starting with one object to give n objects, then it is equivalent to the Yule model. Note that in the coalescent model there is commonly a 17 18 h i h i (n) (n) (n) (n) (n) ~ (n) ~ ~ Var  ( E  ) limit approximation Var  ( E  ) limit approximation (N) (N) (N) (N) ~ ~ Yule N (0; 1) W Y Alg. 2 Y Alg. 3 Yule N (0; 1) Y Alg. 2 Y Alg. 3 Run time 690:918s | 3:905s 0:318s 110:021s 698:269s | 0:233s 44:358s Avg. (= 0) 0:023; 0:029 0 0:025; 0:024 0:024; 0:026 0:019; 0:026 0:02; 0:025 0 0:033; 0:032 0:028; 0:02 0:002 0 0:001 0:006 0 0 0 0:001 0 Var. (= 1) 0:946; 1:074 1 0:921; 1:025 0:928; 1:072 0:932; 1:061 0:939; 1:038 1 0:953; 1:087 0:931; 1:047 1:003 1 0:97 1:014 1 1 1 1:012 1:001 Skew. 1:480; 1:834 0 1:487; 1:917 1:67; 2:124 1:62; 2:197 1:138; 1:368 0 1:163; 1:368 1:159; 1:352 1:643 0 1:68 1:97 1:858 1:245 0 1:25 1:253 Ex. kurt. 3:123; 7:222 0 3:148; 6; 753 3:690; 8:31 3:5; 9:428 1:374; 2:853 0 1:392; 2:707 1:481; 2:88 4:639 0 4:575 6:377 5:435 1:95 0 1:989 2 q(0:025) 1:235;1:194 1:96 1:206;1:174 1:115; 1:087 1:114;1:091 1:257;1:226 1:96 1:276;1:239 1:266;1:23 1:215 1:96 1:19 1:1 1:101 1:245 1:96 1:257 1:246 q(0:05) 1:15;1:104 1:644 1:115;1:085 1:048;1:023 1:047;1:024 1:18;1:146 1:644 1:184;1:154 1:175;1:146 1:123 1:644 1:1 1:033 1:036 1:162 1:644 1:17 1:161 q(0:95) 1:861; 2:07 1:644 1:82; 2:013 1:863; 2:081 1:873; 2:066 1:844; 2:024 1:644 1:883; 2:066 1:856; 2:034 1:946 1:644 1:914 1:942 1:969 1:949 1:644 1:958 1:949 q(0:975) 2:436; 2:735 1:96 2:436; 2:732 2:434; 2:823 2:536; 2:792 2:328; 2:607 1:96 2:383; 2:645 2:360; 2:62 2:587 1:96 2:549 2:642 2:634 2:486 1:96 2:5 2:488 RMSE 0:053 0:762 0:062 0:11 0:108 0:040 0:663 0:048 0:040 Table 1: Simulations based on 100 independent repeats of 10000 independent draws of each random variable (population size for Alg. 2) i.e. columns, bar N (0; 1). The value on the left is the minimum observed from the 100 repeats, on the right the maximum and in the line below from pooling all repeats together. The running times are averages of 100 independent repeats with 10000 draws each. The abbreviations in the row names are for average (Avg.), variance (Var.), skewness (Skew.) excess kurtosis (Ex. kurt.) and root{ mean{square{error (RMSE). The rows q( ) correspond to the, simulated, bar N (0; 1), quantiles i.e. for a random variable X , P (X  q( )) = . All simulations were done in R with the package TreeSim (Stadler, 2009, 2011) used to obtain the Yule trees with speciation rate  = 1, n = 500 tips and a root edge. The (n) (n) Yule tree  ,  values are centred and scaled by expectation and standard deviation from Eqs. (6) and (7). Other centrings and scalings are summarized in Tab. 2. N = 10 for Algs. 2 and 3 is the number of generations and recursion depth of the respective algorithm. In Alg. 2 the initial population is set at 0 and also Y = 0 for Alg. 3. The simulations were run in R 3:4:2 for openSUSE 42:3 (x86 64) on a 3:50GHz. R R Intel Xeon CPU E5{1620 v4. The calculation of the RMSE is described in the text next to Eq. (15). (N) (N) W Y Y Centring () 2(n H )=(n 1) 1 1 n;1 p p p Scaling () (2 9)=9 1 + 1=16 ( 12) NRE (N) (N) W Y Y N NRE NRE Centring () (n + 1 2H )=(n 1) 0 0 n;1 Scaling () (2 18)=9 1=4 ( 12) Table 2: Centrings and scalings applied to obtain the entries in Tab. 1. For a random variable X by its centred and scaled version we mean (X )=. These centrings and scalings are required to obtain mean zero, variance 1 versions of the random variables, i.e. so that they have the same location and scale as the z{transformed cophenetic index. In case of W we take the (N) asymptotic scaling (Thm. 6.8) to be comparable with Y . For the conve- nience of the reader we also provide corresponding centrings and scalings in the no root{edge setup (not considered in Tab. 1). −2 0 2 4 6 8 0 2 4 6 Figure 2: Density estimates of scaled (by theoretical standard deviation) and centred (by theoretical expectation) cophenetic indices (black) from 10000 simulated 500 tip Yule trees with  = 1 and of simulation by Alg. 3 (gray), also scaled and centred to mean 0 and variance 1. Left: density estimates (n) (n) for  , right:  . The curves are calculate by R's density() function. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 probability distribution for the times of coalescences, but in the Yule model we ignore this element." To di erentiate between such trees one needs to take into consideration the branch lengths. Here we compare the power of (n) (n) the Sackin's, Colless',  and  indices at the 5% signi cance level. The null hypothesis is always that the tree is generated by a pure{birth process with rate  = 1. The alternative ones are birth{death processes ( = 1, death rate  = 0:25, and 0:5 using the TreeSim package), coalescent process (ape's rcoal() function Paradis et al., 2004) and the biased speciation model for p 2 f0:05; 0:1; 0:125; 0:15; 0:18; 0:2; 0:25; 0:4; 0:5g: We also simulate a pure{birth process to check if the signi cance level is met. All trees were simulated with an exp(1) root edge. The so{called biased speciation model with parameter p is the tree growth model as described by Blum and Fran cois (2005). In their words, \Assume that the speciation rate of a speci c lineage is equal to r (0  r  1). When a species with speciation rate r splits, one of its descendant species is given the rate pr and the other is given the speciation rate (1 p)r where p is xed for the entire tree. These rates are e ective until the daughter species themselves speciate. Values of p close to 0 or 1 yield very imbalanced trees while values around 0:5 lead to over{balanced phylogenies." We simulated such trees with in{house R code. The quantiles of Sackin's and Colless' indices were obtained using Alg. 3. It is known (Eqs. 2 and 3 Blum and Fran cois, 2005; Blum et al., 2006) that after normalization (centring by expectation and dividing by n) in the limit they satisfy a contraction{type distribution of the form of Eq. (14), i.e. 0 00 Y = Y + (1  )Y + C ( ) for   Unif [0; 1]. The function C ( ) takes the form C ( ) = 2 log  + 2(1  ) log(1  ) + 1 in Sackin's case and C ( ) =  log  + (1  ) log(1  ) + 1 2 min(; 1  ) in Colless' case. It particular, studying the limit of Sackin's index is equiv- alent to studying the Quicksort distribution (Blum and Fran cois, 2005). We 20 can immediately see the main qualitative di erence, the limit of the normal- 2 2 ized cophenetic index has the square in  , (1  ) in the \recursion part" while Sackin's and Colless' have  , (1  ). Using 10000 repeats of Alg. 3 with recursion depth 10 we obtained the following sets of quantiles q(0:025) = 0:983, q(0:95) = 1:189, q(0:975) = 1:493, and q(0:025) = 1:354, q(0:95) = 1:494, q(0:975) = 1:868 respectively for the normalized Sackin's and Colless' indices. Under each model we simulated 10000 trees conditioned on 500 contem- porary tips. We then checked if the tree was outside the 95% \Yule zone" (Yang et al., 2017) by the procedure described in Alg. 1. We calculated the normalized Sackin's, Colless', discrete and continuous cophenetic in- dices (normalizations from Thm. 2.12). The functions sackin.phylo() and colless.phylo() of the phyloTop (Kendall et al., 2016) R package were used while the two cophenetic indices were calculated using a linear time in{house R implementation based on traversing the tree (Cor. 3 Mir et al., 2013). Two tests were considered, a two{sided one and a right{tailed one. For the discrete cophenetic index the quantiles from the simulation by Alg. 3 were considered, for the continuous those from W (Tab. 1). The power is then estimated as the fraction of times the null hypothesis was rejected and represented in Tab. 3 by the corresponding Type II error rates. For the Yule tree simulation we can see that the signi cance level is met. All simulated trees are independent of the trees used to obtain the values in Tab. 1 and quantiles of Sackin's and Colless' indices. Hence, they o er a validation of the rejection regions. We summarize the power study in Tab. 3. As indicated in Alg. 1 one should rst \correct" the tree for the spe- ciation rate, when using the cophenetic index with branch lengths. The distributional results derived here on the cophenetic indices are for a unit speciation rate Yule tree. For a mathematical perspective this is not a sig- ni cant restriction. If one has a pure{birth tree generated by a process with speciation rate  6= 1, then multiplying all branch lengths by  will make the tree equivalent to one with unit rate. Hence, all the results presented here are general up to a multiplicative constant. However, from an applied perspective the situation cannot be treated so lightly. For example, if we used the cophenetic index with branch lengths from a Yule tree with a very large speciation rate, then we would expect a signi cant deviation. However, unless one is interested in deviations from the unit speciation rate Yule tree, this would not be useful. Hence, one needs to correct for this e ect. If the tree did come from a Yule process, then an estimate, , of the speciation rate 21 22 Model Sackin's Colless' c c ^ ^ > 2 > 2 > 2 > > 2 2 mean() variance() Yule 0:952 0:952 0:955 0:955 0:953 0:952 0:949 0:95 0:944 0:944 1 0:002 Coalescent 0:955 0:954 0:956 0:959 0:952 0:955 0:936 0 0:881 0 37:836 42:06 birth{death  = 0:25 0:952 0:953 0:956 0:955 0:948 0:952 0:853 0:874 0:903 0:91 0:87 0:002 birth{death  = 0:5 0:95 0:95 0:952 0:955 0:951 0:953 0:635 0:729 0:739 0:808 0:722 0:001 biased speciation p = 0:05 0 0 0 0 0 0 0 0:98 0 0:542 0:004 4:213 10 biased speciation p = 0:1 0 0 0 0 0 0 0 0:982 0 0:521 0:004 4:241 10 biased speciation p = 0:125 0 0 0 0 0:016 0:431 0 0:981 0 0:522 0:004 4:211 10 biased speciation p = 0:18 1 1 0:497 0:959 1 1 0 0:981 0 0:524 0:004 4:191 10 biased speciation p = 0:2 1 1 1 1 1 1 0 0:982 0 0:508 0:004 4:222 10 biased speciation p = 0:25 1 0:834 1 1 1 1 0 0:98 0 0:515 0:004 4:243 10 biased speciation p = 0:4 1 0 1 0 1 0 0 0:982 0:001 0:51 0:004 4:218 10 biased speciation p = 0:5 1 0 1 0 1 0 0 0:983 0:002 0:509 0:004 4:335 10 Table 3: Power, presented as Type II error rates, of the various indices to detect deviations from the Yule tree for various alternative models at the 5% signi cance level. In the rst row the trees were simulates under the Yule (i.e. we present the Type I error rate) so this is a con rmation of correct signi cance level. Each probability is the fraction of 10000 independently simulated trees that were accepted as Yule trees by the various tests. Columns with \>" label indicate right{tailed test and with label \2" the two{sided test. The critical regions for the cophenetic indices were taken from the pooled estimates in Tab. 1. The superscript c indicates tests, where the trees were corrected for the speciation rate through multiplying all ^ ^ branch lengths with . The mean and variance over all trees of , as obtained through ape's yule() function is reported. Each tree's branches were scaled by its particular  estimate. by maximum likelihood can be obtained. For example, in the work here we used ape's yule() function. Then, one multiplies all branch lengths in the tree by  and calculates the cophenetic index for this transformed tree. It is important to point out that  is only an estimate and hence a random vari- able. The e ects of the this source of randomness on the limit distribution deserve a separate, detailed study. Balance indices that do not use branch lengths do not su er from this issue but on the other hand miss another aspect of the tree|proportions between branch lengths that are non{Yule like. The power analysis presented in Tab. 3 generally agrees with intuition and the power analysis done by Blum and Fran cois (2005). The rst row shows that for all tests and statistics the 5% signi cance level is approx- imately kept. Then, in the next three rows (coalescent and birth{death process) all topology based indices fail completely (the power is at the sig- ni cance level). This is completely unsurprising as the after one removes all speciation events (with lineages) leading to extinct species from a birth{ death tree, the remaining tree is topologically equivalent to a pure birth tree. The same is true for the coalescent model, its topology is identical in law to the Yule tree's one. The cophenetic index with branch lengths has a high Type II error rate but is still better, than the topological indices. However, when one \corrects for " this index manages to nearly (2 trees were not rejected by the two{sided test) perfectly reject the coalescent model trees. Power for the biased speciation model follows the same pattern as Blum and Fran cois (2005) observed. When imbalance is evident, p  0:125, all ( uncorrected) tests were nearly perfect (two{sided discrete cophenetic is an exception). However, the  correction signi cantly worsened the ability (n) of  to detect deviations. As imbalance decreased so did the power of the topological indices. For overbalanced trees one{sided tests failed, two sided worked (just as Blum and Fran cois, 2005, observed). The cophenetic index with branch lengths (without correction), that does not consider only the topology, was able to successfully reject the pure{birth tree for all p (with only minimal Type II error for p  0:4 in the two{sided test case). (n) Interestingly,  's (both corrected and uncorrected) power seems invariant (n) with respect to p. These results are especially promising as  seems to be an index that functions signi cantly better in the dicult, 0:18  p  0:25, regime, even after correcting. At this stage we can point out that a normal approximation to the cophe- netic indices' limit distribution is not appropriate. When doing the above 23 power study we observed that when using the quantiles of the standard nor- (n) mal distribution the right{tailed test based on  rejects 6:81% of Yule (n) (n) trees, based on  rejects 7:03% of Yule trees, two{sided  test rejects (n) 4:87% of Yule trees and two{sided  test rejects 4:66% of Yule trees. The Type I error rates of the two{sided tests are within the observed Monte Carlo errors (in Tab. 3) but the right{tailed tests' Type I error are evidently in- ated. This con rms that the right tail of the scaled cophenetic index is much heavier than normal. In short the power study indicates that the cophenetic index with branch lengths should be considered as an option to detect deviations from the Yule tree. This is because it is able to use information from two sources|the topology and time (a needed direction of development, as indicated in Ch. 33 of Felsenstein, 2004). Actually, this is evident in the decomposition of Eq. (n) (3). The V s describe the topology and the Z s branch lengths. With more information a more powerful testing procedure is possible. Deviations that are not topologically visible, e.g. biased speciation in the 0:18  p  0:25 (n) regimes, are now detectable. To use  one should correct for the e ects of the speciation rate, as otherwise one merely detects deviations from the unit rate Yule tree. This correction is a mixed blessing. It can help or hinder detection. 4.3 Examples with empirical phylogenies It is naturally interesting to ask how do the indices behave for phylogenies estimated from sequence data. Comparing a database of phylogenies, like TreeBase (http://www.treebase.org), with yet another index's distribu- tion under the Yule model should not be expected to yield interesting results. The Yule model has been indicated as inadequate to describe the collection of TreeBase's trees (e.g. Blum and Fran cois, 2007). Therefore, we choose a par- ticular study that estimated a tree and also reported a collection of posterior trees. Sosa et al. (2016a) is a recent work, providing all trees from BEAST's (Drummond and Rambaut, 2007) output, well suited for such a purpose. Sosa et al. (2016a) estimate the evolutionary relationships between a set of 109 tree ferns species. They report a posterior set of 22498 phylogenies (Sosa et al., 2016b). In Tab. 4 we look what percentage of the trees from the posterior was ac- cepted as being consistent with the Yule tree by the various tests and indices. It can be seen that the discrete cophenetic index has a high acceptance rate. 24 The continuous one, which also takes into account branch lengths did not ac- cept a single tree. However, this is lost when one corrects for the speciation rate ( rst ape's multi2di() was used to make the trees binary ones). Most tests and indices rejected the Yule tree for the maximum likelihood estimate of the phylogeny with some exceptions. The two{sided discrete cophenetic index test did not reject the null hypothesis of the pure{birth tree. Also after correcting for the speciation rate (estimated at  = 0:023), neither test based on the continuous cophenetic index rejected the Yule tree. Therefore, one should conclude (based on the \topological balances") that the Yule tree null hypothesis can be rejected for this clade of plants. Sackin's Colless' c c > Sackin's 2 > 2 > 2 > > 2 2 0:03 0:109 0:019 0:062 0:385 0:559 0 0:974 0 0:995 Table 4: Percentage of trees from Sosa et al. (2016b)'s set of posterior trees accepted as Yule trees by the various tests and indices. Columns with \>" label indicate right{tailed test and with label \2" the two{sided test. The critical regions for the cophenetic indices were taken from the pooled esti- mates in Tab. 1. The superscript c indicates tests, where the trees were corrected for the speciation rate through multiplying all branch lengths with . Ape's yule() function returned an average over all trees estimate of of 0:023 with variance 2:988 10 . Each tree's branches were scaled by its particular  estimate. Each tree was rst transformed by ape's multi2di() into a binary one. We also followed Blum and Fran cois (2005) in looking at Yusim et al. (2001)'s phylogeny of the human immunode ciency virus type 1 (HIV{1) group M gene sequences, available in the ape R package. The phylogeny consists of 193 tips and Blum and Fran cois (2005) could not reject the null hypothesis of the pure{birth tree (using Sackin's index amongst others). Af- ter pruning the tree to keep \only the old internal branches that corresponded to the 30 oldest ancestors" they were able to reject the Yule tree. They con- clude that the \results probably indicate a change in the evolutionary rate during the evolution which had more impact on cladogenesis during the early expansion of the virus." Repeating their experiment we nd that only the two versions of the cophenetic index point to a deviation but only in the two{ 25 (n) sided test (see Tab. 5). Based only on the  's test and that it con icts with the conclusions of Sackin's and Colless' one should not draw any con- (n) clusions. However, as  's test indicates a deviation, we can be inclined to reject the null hypothesis of the Yule tree. This is further strengthened by the fact that the signi cance remains after the correction for . Even though the topology as a whole seems consistent with the pure{birth tree the branch lengths are not. The fact that only the two{sided test rejected the Yule tree indicates that the HIV phylogeny is over{balanced in comparison to a pure{birth tree. In fact, in the biased speciation model tree over{balance is observed for values of p close to 0:5 (Blum and Fran cois, 2005). Such trees have a declining speciation rate as they grow and hence this supports Blum and Fran cois (2005)'s aforementioned explanation. ~ ^ Sackin's Colless' ; ; ; ; ; 0:823 0:993 1:689 1:765 1:602 9:313 Table 5: Values of the normalized indices for Yusim et al. (2001)'s HIV{1 phylogeny. Above each index is an indication if the index deviates at the 5% signi cance level from the Yule tree, dash insigni cant, asterisk signi cant. The rst symbol concerns the right{tailed test, the second the two{sided test. The superscripted  is calculated from the tree corrected for the speciation rate by multiplying all branch lengths by . 5 Almost sure behaviour of the cophenetic index (n) We study the asymptotic distributional properties of  for the pure{birth tree model using techniques from our previous papers on branching Brownian and Ornstein{Uhlenbeck processes (Bartoszek, 2014; Bartoszek and Sagitov, 2015a,b; Sagitov and Bartoszek, 2012). We assume that the speciation rate of the tree is  = 1. The key property we will use is that in the pure{ birth tree case the time between two speciation events, k and k + 1 (the rst speciation event is at the root), is exp(k) distributed, as the minimum of k exp(1) random variables. We furthermore, assume that the tree starts with a single species (the origin) that lives for exp(1) time and then splits (the root 26 of the tree) into two species. We consider a conditioned on n contemporary species tree. This conditioning translates into stopping the tree process just before the n + 1 speciation event, i.e. the last interspeciation time is exp(n) (n) distributed. We introduce the notation that U is the height of the tree, (n) is the time to coalescent of two randomly selected tip species and T is the time between speciation events k and k + 1 (see Fig. 3 and Bartoszek and Sagitov, 2015b; Sagitov and Bartoszek, 2012). Figure 3: A pure{birth tree with the various time components marked on it. The between speciation times on this lineage are T , T , T + T and T . 1 2 3 4 5 If we \randomly sample" the pair of extant species \A" and \B", then the (n) two nodes coalesced at time  . Theorem 5.1 The cophenetic index is an increasing sequence of random (n+1) (n) variables,  >  and has the recursive representation n n X X (n) (n) (n+1) (n) (n) =  + nU   ; (16) i ij i=1 i6=j (n) where  is an indicator random variable whether tip i split at the n{th speciation event. 27 Proof From the de nition we can see that (n) (n) (n) (n) (n) = U  = U E  jY ; ij 1i<jn (n) where  is the time to coalescent of tip species i and j. We now develop ij a recursive representation for the cophenetic index. First notice that when a new speciation occurs all coalescent times are extended by T , i.e. n+1 n n P P P P (n+1) (n) (n) (n) =  + T +   + T + T ; n+1 n+1 n+1 ij ij i ij 1i<jn+1 1i<jn i=1 i6=j where the \lone" T is the time to coalescent of the two descendants of the n+1 (n) (n) split tip. The vector  ; : : : ;  consists of n 1 0s and exactly one 1 (a categorical distribution with n categories all with equal probability). For (n) each i the marginal probability that  is 1 is 1=n. We rewrite n n P P P P (n+1) (n) (n) (n) n+1 = T  + n+1 ij ij i ij 1i<jn+1 1i<jn i=1 i6=j and then obtain the recursive form (n) n+1 n+1 n+1 (n+1) (n) = U + T T  + T n+1 n+1 n+1 ij 2 2 2 1i<jn n n P P (n) (n) i ij i=1 i6=j n n P P P (n) (n) (n) n+1 (n) = U  + T n+1 ij i ij i=1 1i<jn i6=j n n P P (n) (n) (n) (n) =  + nU   : i ij i=1 i6=j (n+1) (n) Obviously,  >  . Proofof Theorem 2.4. Obviously n n X X n 1 2 n + 1 (n) (n) (n) W = W + U n+1 n i ij n + 1 n + 1 2 i=1 i6=j 28 and n n P P (n) n+1 n1 (n) 1 E [W jY ] = W + nU n+1 n n ij n+1 2 n i=1 i6=j 1 2 (n) n+1 n1 2 n (n) = W + U ij n+1 2 n 2 i<j n+1 n n1 2 n (n) = W + W + U n n n+1 2 n 2 2 1 1 n+1 n n+1 n1 2 (n) = + W + U n+1 2 n 2 2 (n1)(n+2) n+1 (n) = W + U n(n+1) 2 1 1 1 n+1 n+1 n (n) (n) (n) = W + (U W ) = W + (U  ) n n n 2 2 2 1 1 n+1 n n (n) (n) (n) = W + U (U E  jY ) > W : n n n 2 2 2 Hence, W is a positive submartingale with respect to Y . Notice that n n 2 (n) (n) 2 (n) (n) 2 E W = E (U E  jY )  E (U  ) : (n) (n) Then, using the general formula for the moments of U  (Appendix A, Bartoszek and Sagitov, 2015b), we see that n1 (n) (n) 2 n+1 1 2 E (U  ) = 2 H + H j;2 j;1 n1 (j+1)(j+2) j=1 n1 n+1 n n n;2 j;1 2 2 = 2 H + %  : n;2 n1 n+1 n+1 n+1 (j+1)(j+2) 3 j=1 Hence, E [W ] and E [W ] are O(1) and by the martingale convergence theo- rem W converges almost surely and in L to a nite rst and second moment random variable. Corollary 5.2 W has nite third moment and is L convergent. Proof We rst recall the W is positive. Using the general formula for the (n) (n) moments of U  again we see (n) (n) 3 (n) (n) 3 E (U E  jY )  E (U  ) n1 n+1 1 = 2 (H + 3H + 3H + H ) j;1 j;1 j;2 j;3 n1 (j+1)(j+2) j=1 n1 j;1 n+1 < 16 n1 (j+1)(j+2) j=1 nH nH n+1 n;1 n;1 = 16 = 16 % 16: n1 n+1 n1 3 3 This implies that E [W ] = O(1) and hence L convergence and niteness of the third moment. Remark 5.3 Notice that we (Appendix A, Bartoszek and Sagitov, 2015b) made a typo in the general formula for the cross moment of (n) (n) m (n) E (U  )  : m+r m+r The (1) should not be there, it will cancel with the (1) from the derivative of the Laplace transform. Proofof Theorem 2.7. We write W as n1 k P P (n) (n) (n) (n) (n) W = U E  jY = E U  jY = E 1 T jY n n n i n i=1 k=1 h i n1 n1 n1 n1 P P P P (n) (n) = E T 1 jY = T E 1 jY i n i n k k i=1 k=i i=1 k=i h i n1 n1 n1 P P P (n) (n) = E 1 jY Z = V Z ; n i i k i i=1 k=i i=1 where Z ; : : : ; Z are i.i.d. exp(1) random variables. 1 n1 Remark 5.4 We notice that we may equivalently rewrite ! ! n1 k n1 k h i h i X X X X (n) (n) W = E 1 jY T = E 1 jY Z : (17) n n i n i k k i=1 i=1 k=1 k=1 30 The above and Eq. (3) are very elegant representations of the cophenetic index with branch lengths. They explicitly describe the way the cophenetic index is constructed from a given tree. Proofof Theorem 2.11. The argumentation is analogous to the proof of Thm. 2.4 by using the recursion n n X X (n) (n) (n+1) (n) ~ ~ ~ =  +   +  ; ij i i i=1 i6=j (n) where  is the number of nodes on the path from the root (or appropriately origin) of the tree to tip i, (see also Bartoszek, 2014, esp. Fig. A.8). An alternative proof for almost sure convergence can be found in Section 7.1. 6 Second order properties In this Section we prove a series of rather technical Lemmata and Theorems (n) (n) concerning the second order properties of 1 , V and W . Even though we will not obtain any weak limit, the derived properties do give insight on the delicate behaviour of W and also show that no \simple" limit, e.g. Eq. (4), is possible. To obtain our results we used Mathematica 9:0 for Linux x86 (64{bit) running on Ubuntu 12:04:5 LTS to evaluate the required sums in closed forms. The Mathematica code is available as an appendix to this paper. Lemma 6.1 h i n + 1 1 n + 1 1 (n) Var 1 = 2 1 2 (18) n 1 (k + 1)(k + 2) n 1 (k + 1)(k + 2) Proof h i h i h i (n) (n) (n) Var 1 = E 1 E 1 =   =  (1  ) n;k n;k n;k k k k n;k n+1 1 n+1 1 = 2 1 2 : n1 (k+1)(k+2) n1 (k+1)(k+2) (n) The following lemma is an obvious consequence of the de nition of 1 . Lemma 6.2 For k 6= k 1 2 h i (4)(n + 1) (n) (n) Cov 1 ; 1 =   = : n;k n;k k k 1 2 1 2 2 (n 1) (k + 1)(k + 2)(k + 1)(k + 2) 1 1 2 2 (19) Lemma 6.3 h h ii 2 2 (n) (n(k+1))(n(3k +5k4)(k k8)) n+1 Var E 1 jY = 4 : (20) n 2 2 2 k n(n1) (k+1) (k+2) (k+3)(k+4) Proof Obviously h h ii h i h h ii 2 2 (n) (n) (n) Var E 1 jY = E E 1 jY E E 1 jY : n n n k k k We notice (as Bartoszek and Sagitov, 2015b; Bartoszek, 2016, in Lemmata 11 and 2 respectively) that we may write h i h i (n) (n) (n) E E 1 jY = E 1 1 ; k k;1 k;2 (n) (n) (n) where 1 , 1 are two independent copies of 1 , i.e. we sample a pair k;1 k;2 k of tips twice and ask if both pairs coalesced at the k{th speciation event. There are three possibilities, we (i) drew the same pair, (ii) drew two pairs sharing a single node or (iii) drew two disjoint pairs. Event (i) occurs with 1 1 n n probability , (ii) with probability 2(n2) and (iii) with probability 2 2 n2 n n2 n . As a check notice that 1 + 2(n 2) + = . In case (i) 2 2 2 2 (n) (n) 1 = 1 , hence writing informally k;1 k;2 h i h i (n) (n) (n) E 1 1 j(i) = E 1 =  : n;k k;1 k;2 k To calculate cases (ii) and (iii) we visualize the situation in Fig. 4 and recall the proof of Bartoszek and Sagitov (2015b)'s Lemma 1. Using Mathe- matica we obtain 32 Figure 4: The three possible cases when drawing two random pairs of tip species that coalesce at the k{th speciation event. In the picture we \ran- domly draw" pairs (A; B) and (C; D). h i n1 (n) (n) 3 3 1 1 E 1 1 j(ii) = 1 : : : 1 1 : : : n j+2 j+1 j k;1 k;2 ( ) ( ) ( ) ( ) 2 2 2 2 j=k+1 1 1 k+2 k+1 ( ) ( ) 2 2 (n+1) n(k+1) = 4 : (n1)(n2) (1+k)(2+k)(3+k) Similarly for case (iii) h i n1 j +1 P P (n) (n) 6 6 4 E 1 1 j(iii) = 1 : : : 1 n j +2 j +1 k;1 k;2 2 2 ( ) ( ) ( ) 2 2 2 j =k+2 j =k+1 2 1 3 3 1 1 1 : : : 1 1 : : : j j +2 j +1 j 2 1 1 1 ( ) ( ) ( ) ( ) 2 2 2 2 1 1 k+2 k+1 ( ) ( ) 2 2 (n+1) (n(k+1))(n(k+2)) = 16 : (n1)(n2)(n3) (k+1)(k+2)(k+3)(k+4) We now put this together as h h ii h i 1 1 (n) n n (n) (n) Var E 1 jY =  + 2(n 2) E 1 1 j(ii) n n;k k k;1 k;2 2 2 h i (n) (n) n2 n + E 1 1 j(iii) k;1 k;2 n;k 2 2 33 and we obtain (through Mathematica) h h ii 2 2 (n) (n(k+1))(n(3k +5k4)(k k8)) n+1 Var E 1 jY = 4 n 2 2 2 n(n1) (k+1) (k+2) (k+3)(k+4) 3k +5k4 ! 4 : 2 2 (k+1) (k+2) (k+3)(k+4) Lemma 6.4 For k < k 1 2 h h i h ii (n) (n) (8)(n+1) (3n(k 2))(n(k +1)) 2 2 Cov E 1 jY ; E 1 jY = : n n 2 k k 2 1 n(n1) (k +1)(k +2)(k +1)(k +2)(k +3)(k +4) 1 1 2 2 2 2 (21) Proof Obviously h h i h ii h h i h ii (n) (n) (n) (n) Cov E 1 jY ; E 1 jY = E E 1 jY E 1 jY E [1 ] E [1 ] : n n n n k k k k k k 1 2 1 2 1 2 We notice that h h i h ii h i (n) (n) (n) (n) E E 1 jY E 1 jY = E 1 1 ; n n k k k k 1 2 1 2 (n) (n) where 1 , 1 are the indicator variables if two independently sampled k k 1 2 pairs coalesced at speciation events k < k respectively. There are now two 1 2 possibilities represented in Fig. 5 (notice that since k 6= k the counterpart 1 2 of event (i) in Fig. 4 cannot take place). Event (ii) occurs with probability 4=(n + 1) and (iii) with probability (n 3)=(n + 1). Event (iii) can be divided into three \subevents". Again we recall the proof of Bartoszek and Sagitov (2015b)'s Lemma 1 and we write informally for (ii) using Mathematica h i (n) (n) 3 3 1 1 E 1 1 j(ii) = 1 : : : 1 1 : : : n k +2 k +1 k k k 2 2 2 1 2 ( ) ( ) ( ) ( ) 2 2 2 2 1 1 k +2 k +1 1 1 ( ) ( ) 2 2 (n+1)(n+2) = 4 : (n1)(n2) (k +1)(k +2)(k +2)(k +3) 1 1 2 2 34 Figure 5: The possible cases when drawing two random pairs of tip species that coalesce at speciation events k < k respectively. In the picture we 1 2 \randomly draw" pairs (A; B) and (C; D). In the same way for the subcases of (iii) h i (n) (n) 6 6 1 E 1 1 j(iii) = 1 : : : 1 n k +2 k +1 k k 1 2 2 2 ( ) ( ) ( ) 2 2 2 3 3 1 1 : : : 1 k k +2 k +1 2 1 1 ( ) ( ) ( ) 2 2 2 k 1 6 6 1 + 1 : : : 1 n k +2 k +1 2 2 ( ) ( ) ( ) 2 2 2 j=k +1 3 3 2 1 1 : : : 1 1 : : : k j+2 j+1 j ( ) ( ) ( ) ( ) 2 2 2 2 1 1 k +2 k +1 1 1 ( ) ( ) 2 2 n1 6 6 4 + 1 : : : 1 n j+2 j+1 ( ) ( ) ( ) 2 2 2 j=k +1 3 3 2 1 1 : : : 1 1 : : : j k +2 k +1 k 2 2 2 ( ) ( ) ( ) ( ) 2 2 2 2 1 1 k +2 k +1 1 1 ( ) ( ) 2 2 (n+2)(n+1) n(k +6)5k 14 2 2 = 4 : (n1)(n2)(n3) (k +1)(k +2)(k +2)(k +3)(k +4) 1 1 2 2 2 We now put this together as h h i h ii h i (n) (n) (n) (n) Cov E 1 jY ; E 1 jY = 2(n 2) E 1 1 j(ii) n n k k k k 1 2 2 1 2 h i (n) (n) n2 n + E 1 1 j(iii) n;k n;k k k 1 2 2 2 1 2 35 and we obtain h h i h ii (n) (n) (8)(n+1) (3n(k 2))(n(k +1)) 2 2 Cov E 1 jY ; E 1 jY = n n 2 k k 1 2 n(n1) (k +1)(k +2)(k +1)(k +2)(k +3)(k +4) 1 1 2 2 2 2 ! (24) : (k +1)(k +2)(k +1)(k +2)(k +3)(k +4) 1 1 2 2 2 2 Theorem 6.5 h i 1 n i (n) E V = 2 (22) n 1 i(i + 1) Proof We immediately have h i h h ii n1 (n) (n) E V = E E 1 jY i k k=i n1 n+1 1 1 = 2 n1 i (k+1)(k+2) k=i 1 ni = 2 n1 i(i+1) ! : i(i+1) Theorem 6.6 h i (n + 1) (n i)(n (i + 1))(i 1) (n) Var V = 4 (23) 2 2 2 n(n 1) i (i + 1) (i + 2)(i + 3) Proof We immediately may write using Lemmata 6.3, 6.4 and Mathematica h i h h ii h h i h ii n1 n1 P P (n) (n) (n) (n) Var V = Var E 1 jY + 2 Cov E 1 jY ; E 1 jY 2 n n n i k k k i 1 2 k=i i=k <k 1 2 n1 2 2 (n(k+1))(n(3k +5k4)(k k8)) 4 n+1 2 2 2 2 i n(n1) (k+1) (k+2) (k+3)(k+4) k=i n1 (n+1) (3n(k 2))(n(k +1)) 2 2 n(n1) (k +1)(k +2)(k +1)(k +2)(k +3)(k +4) 1 1 2 2 2 2 i=k <k 1 2 (n+1) (ni)(n(i+1)(i1) = 4 2 2 2 n(n1) i (i+1) (i+2)(i+3) (i1) ! 4 : 2 2 i (i+1) (i+2)(i+3) 36 Theorem 6.7 For 1  i < i  n 1 we have 1 2 h i (n + 1) (i 1)(n i )(n (i + 1)) (n) (n) 1 2 2 Cov V ; V = 4 : (24) i i 1 2 n(n 1) i (i + 1)i (i + 1)(i + 2)(i + 3) 1 1 2 2 2 2 Proof Again using Lemmata 6.3, 6.4, Mathematica and the fact that i < i 1 2 h i h i h i n1 n1 P P (n) (n) (n) (n) Cov V ; V = Cov E 1 jY ; E 1 jY n n i i k k 1 2 i i 1 2 k=i k=i 1 2 h i h i h i n1 i 1 n1 P P P (n) (n) (n) = Var E 1 jY + Cov E 1 jY ; E 1 jY n n n k k k i i 1 2 k=i k=i k=i 2 1 2 h i h i h i i 1 n1 n1 P P P (n) (n) (n) = (i ) Var V + Cov E 1 jY ; E 1 jY n n 2 i k k i i 2 1 2 1 2 k =i k =i k=i 1 1 2 2 2 (n+1) (i 1)(ni )(n(i +1) 1 2 2 = 4 n(n1) i (i +1)i (i +1)(i +2)(i +3) 1 1 2 2 2 2 i 1 ! 4 : i (i +1)i (i +1)(i +2)(i +3) 1 1 2 2 2 2 Theorem 6.8 n1 (n) 4 3 2 Var V = (179n + 588n + 133n 432n 2 2 54n (n1) i=1 468 108n (n + 1)(n + 3)H n1;2 144nH ) !  1:347; n1;1 54 3 n1 (n) 1 2 2 4 Var V Z = (12n (n 6n 4)H 9n i 2 2 n1;2 9n (n1) i=1 3 2 +102n + 51n 24nH 72n 72) n1;1 2 2 !  1  1:193; h i n1 (n) 2 4 3 Var E V Z = ((12H 18) n 24n i 2 2 n1;2 3n (n1) i=1 2 2 +12n (2n + 1)H 24n + 24n + 12) n1;2 4 2 !  12  1:159; h i n1 (n) (n) 1 4 3 2 Var V E V Z = (99n + 174n 21n 144n i 2 2 i i 9n (n1) i=1 108 12n (n + 1)(5n + 7)H n1;2 10 2 24nH ) ! 11   0:034: n1;1 (25) 37 Proof We use Mathematica to rst calculate h i h i n1 n1 n1 P P P (n) (n) (n) (n) Var V = Var V + 2 Cov V ; V i i i i 1 2 i=1 i=1 1=i <i 1 2 1 4 2 3 = (179n 108n (n + 1)(n + 3)H + 588n 2 2 n1;2 54n (n1) +133n 144nH 432n 468) n1;1 !  1:347: 54 3 For the second we again use Mathematica and the fact that the Z s are i.i.d. exp(1). h i n1 n1 P P (n) 1 ni Var E V Z = 2 n1 i(i+1) i=1 i=1 4 2 3 2 2(12H 18)n +2 6n (2n+1)H 12n 12n +12n+6 n1;2 ( n1;2 ) 2 2 3n (n1) 4 2 !  12  1:159: For the third equality we use Mathematica and the fact that for independent families fXg and fYg of random variables we have 2 2 Var [XY ] = E [Y ] Var [X ] + (E [X ]) Var [Y ] ; Cov [X Y ; X Y ] = E [Y ] E [Y ] Cov [X ; X ] + E [X ] E [X ] Cov [Y ; Y ] : 1 1 2 2 1 2 1 2 1 2 1 2 As the Z s are i.i.d. exp(1) we use Mathematica to obtain h i h i n1 n1 n1 P P P (n) (n) (n) (n) Var V Z = Var V Z + 2 Cov V Z ; V Z i i i i i i i 1 i 2 1 2 i=1 i=1 1=i <i 1 2 h i  h i h i n1 n1 n1 P P P (n) (n) (n) (n) = 2 Var V + E V + 2 Cov V ; V i i i i 1 2 i=1 i=1 1=i <i 1 2 2 2 4 3 = (12n (n 6n 4)H 9n + 102n 2 2 n1;2 9n (n1) +51n 24nH 72n 72) n1;1 1 2 ! (2 9)  1:193: For the fourth equality we use the same properties and pair{wise indepen- dence of Z s. h i h i h h i i n1 n1 n1 P P P (n) (n) (n) (n) Var V E V Z = Var V Z + Var E V Z i i i i i i i i=1 i=1 i=1 h h i i n1 (n) (n) 2 Cov V Z ; E V Z i i i 1 i 2 1 2 1=i <i 1 2 h i  h i  h i n1 n1 n1 2 2 P P P (n) (n) (n) = Var V Z + E V 2 E V i i i i=1 i=1 i=1 h i  h i n1 n1 P P (n) (n) = Var V Z E V i i i=1 i=1 4 2 3 2 = (99n 12n (n + 1)(5n + 7)H + 174n 21n 2 2 n1;2 9n (n1) 24nH 144n 108) ! 11   0:034: n1;1 It is worth noting that the above Lemmata and Theorems were con rmed by numerical evaluations of the formulae and comparing these to simulations performed to obtain Fig. 1. As a check also notice that, as implied by variance properties, h i  h i n1 n1 P P (n) (n) (n) Var E V Z + Var V E V Z i i i i i i=1 i=1 n1 (n) 4 2 10 2 2 2 !  12 + 11  =  1 Var V Z : 3 9 9 i=1 Theorem 6.9 2 2 33 h i (n) (n) n2 P V Z E V 1 i i (n) 4 4 55 E Var (n) fV g ! 0:5: (26) h i (n) i=2 Var V (n) Proof Using the limit for the variance of V (Thm. 6.6) and the indepen- dence of the Z s we have 2 2 33 h i (n) (n) n2 n2 P V Z E V P i 2 2 i i (n) i (i+1) (i+2)(i+3) (n) 1 1 4 4 r 55 E Var fV g  E [( V ) ] : h i i i n 4n (i1) (n) i=2 Var V i=2 Now from Thms. 6.6 and 6.5 we have (n) (ni)(n(i+1))(i1) 2 n+1 2 ni E [( V ) ] = 4 + 2 2 2 n(n1) i (i+1) (i+2)(i+3) n1 i(i+1) (ni) (n(i+1))(i1) 1 n+1 i+5 = 4 + 1 ! 4 : 2 2 2 2 (n1) i (i+1) n(ni) (i+2)(i+3) i (i+1)(i+2)(i+3) Plugging this in (and using Mathematica) n2 n2 P 2 2 P 2 2 (n) i (i+1) (i+2)(i+3) i (i+1) (i+2)(i+3)4(i+5) 1 2 1 E [( V ) ] 2 2 2 4n (i1) 4n (i1)i (i+1)(i+2)(i+3) i=2 i=2 n2 (i+1)(i+5) 2 2 1 2 = n = n (n + 11n + 24H 42) ! 0:5: n;1 (i1) 2 i=2 Remark 6.10 Simulations presented in Fig. 6 and Thm. 6.9 suggest a di erent possible CLT, namely h i (n) (n) n2 V Z E V i i weakly (n) r ! some distribution(mean = 0; variance = ): h i (n) i=2 Var V (27) (n) (n) We sum over i = 2; : : : n 2 as V = 1 and V = for all n. It 1 n1 would be tempting to take the distribution to be a normal one. However, we should be wary after Rem. 2.9 and Fig. 1 that for our rather delicate problem even very ne simulations can indicate incorrect weak limits. It remains to study the variance of the conditional variance in Eq. (26). It is not entirely clear if this variance of the conditional variance will converge to 0. Hence, it remains an open problem to investigate the conjecture of Eq. (27). 7 Alternative descriptions 7.1 Di erence process h i (n) (n) Let us consider in detail the families of random variables V and E 1 jY . i k (n) Obviously V is i times the number of pairs that coalesced after the i 1 speciation event for a given Yule tree. Denote 40 0 2 4 6 8 −2 −1 0 1 2 Figure 6: Density estimates of scaled and centred cophenetic indices for 10000 simulated 500 tip Yule trees with  = 1. Left: density estimate (n) (n) (n) of ( E  )= Var [ ]. The black curve is the density tted to simulated data by R's density() function, the gray is the N (0; 1) density. Right: simulation of Eq. (27), the gray curve is the N (0; 1=2) density, and the black curve is the density tted to simulated data by R's density() function. The sample variance of the simulated Eq. (27) values is 0:385 indicating that with n = 500 we still have a high variability or alternatively that the variance of the sample variance in Eq. (26) does not converge to 0. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.2 0.4 0.6 0.8 (n) (n) A := iV : i i As going from n to n + 1 means a new speciation event and coalescent at this new nth event, then n + 1 n (n+1) (n) A  A + 1 : i i 2 2 We also know by previous calculations that h i h i (n) (n) E A = i E V = 2(n i)=((n 1)(i + 1)) ! 2=(i + 1): i i n+1 (n) Let  denote the number of newly introduced coalescent events after (n) n+1 the (i 1){one when we go from n to n + 1 species. Obviously  > . Then, we may write n + 1 n (n+1) (n) (n) A = A +  : i i i 2 2 Now, h i h i h i (n) (n+1) (n) n+1 n 2(n+1i) n(n1) 2(ni) E  = E A E A = i i i 2 2 n(i+1) n(n+1) (n1)(i+1) (ni+1)(n+1)n(ni) n(ni)+n+ni+1n(ni) 2 n+1i ni 2 2 = = = i+1 n n+1 i+1 n(n+1) i+1 n(n+1) 2 2n+1i = ! 0: i+1 n(n+1) (n) Therefore, for every i,  ! 0 almost surely as it is a positive random (n) variable whose expectation goes to 0. However, A is bounded by 1, as it can be understood in terms of the conditional (on tree) cumulative distribution function for the random variable |at which speciation event did a random pair of tips coalesce, i.e. for all i = 1; : : : ; n 1 (n) P (  i 1jY ) = 1 A : (n) Therefore, as A is bounded by 1 and the di erence process (n) (n1) (n) A A = i i i 42 (n) goes almost surely to 0 we may conclude that A converges almost surely to some random variable A . In particular, this implies the almost sure conver- h i (n) n1 (n) gence of V to a limiting random variable V . Furthermore, as E V i i=1 i h i P P (n) (n) n1 n1 and Var V are both O(1) we may conclude that V also i i i=1 i=1 converges almost surely. This means that the discrete version (all T = 1, (n) corresponding to  ) of the cophenetic index converges almost surely (com- pare with Thm. 2.11). 7.2 Poly a urn description The cophenetic index both in the discrete and continuous version has the following Poly a urn description. We start with an urn lled with n balls. Each ball has a number painted on it, 0 initially. At each step we remove a pair of balls, say with numbers x and y and return a ball with the number (x + 1)(y + 1) painted on it. We stop when there is only one ball, it will have value . Denote B as the value painted on the k{th ball in the k;i;n i{th step when we initially started with n balls. Then we can represent the cophenetic index as n1 i n1 i X X XX (n) (n) = B T and  = B : k;i;n i k;i;n i=1 k=1 i=1 k=1 Acknowledgments I was supported by the Knut and Alice Wallenberg Foundation and am now by the Swedish Research Council (Vetenskapsr adet) grant no. 2017{04951. I am grateful to the Barcelona Graduate School of Mathematics (BGSMath) for sponsoring the Workshop on Algebraical and Combinatorial Phyloge- netics which signi cantly contributed to the development of my work. I would like to thank the whole Computational Biology and Bioinformatics Research Group of the Balearic Islands University for hosting me on mul- tiple occasions, many discussions and suggestions on phylogenetic indices. My visits to the Balearic Islands University were partially supported by the the G S Magnuson Foundation of the Royal Swedish Academy of Sciences (grants no. MG2015{0055, MG2017{0066) and The Foundation for Scien- ti c Research and Education in Mathematics (SVeFUM). I would like to ac- 43 knowledge Gabriel Yedid for numerous discussions on the distribution of the cophenetic index and sharing his cophenetic index simulation R code. I am grateful to Cecilia Holmgren and Svante Janson for pointing me to the works on contraction{type distributions and many discussions. I would furthermore like to acknowledge Wojciech Bartoszek, Sergey Bobkov, Joachim Domsta, Serik Sagitov, Mike Steel for helpful comments and discussions related to this work. I am indebted to two anonymous reviewers, an anonymous editor and Haochi Kiang for careful reading of an earlier version of the manuscript and comments signi cantly improving it. References P.{M. Agapow and A. Purvis. Power of eight tree shape statistics to detect nonrandom diversi cation: a comparison by simulation of two models of cladogenesis. Syst. Biol., 51(6):866{872, 2002. K. Bartoszek. Quantifying the e ects of anagenetic and cladogenetic evolu- tion. Math. Biosci., 254:42{57, 2014. K. Bartoszek. A central limit theorem for punctuated equilibrium. ArXiv e-prints, 2016. K. Bartoszek and S. Sagitov. A consistent estimator of the evolutionary rate. J. Theor. Biol., 371:69{78, 2015a. K. Bartoszek and S. Sagitov. Phylogenetic con dence intervals for the opti- mal trait value. J. App. Prob., 52:1115{1132, 2015b. M. G. B. Blum and O. Fran cois. On statistical tests of phylogenetic tree imbalance: The Sackin and other indices revisited. Math. Biosci., 195: 141{153, 2005. M. G. B. Blum and O. Fran cois. Which random processes describe the Tree of Life? A large{scale study of phylogenetic tree imbalance. Syst. Biol, 55 (4):685{691, 2007. M. G. B. Blum, O. Fran cois, and S. Janson. The mean, variance and limiting distribution of two statistics sensitive to phylogenetic tree balance. Ann. Appl. Probab., 16(4):2195{2214, 2006. 44 G. Cardona, A. Mir, and F. Rossell o. Exact formulas for the variance of several balance indices under the Yule model. J. Math. Biol., 67:1833{ 1846, 2013. D. H. Colless. Review of \Phylogenetics: the theory and practise of phylo- genetic systematics". Syst. Zool., 31:100{104, 1982. L. Devroye, J. A. Fill, and R. Neininger. Perfect simulation from the Quick- sort limit distribution. Electronic Comm. Probab., 5(12):95{99, 2000. A. J. Drummond and A. Rambaut. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol. Biol., 7:214, 2007. W. Ewens and G. Grant. Statistical Methods in Bioinformatics: An Intro- duction. Springer, New York, 2005. J. Felsenstein. Inferring Phylogenies. Sinauer Associates Inc., Sundarland, U.S.A., 2004. J. A. Fill and S. Janson. Smoothness and decay properties of the limit- ing Quicksort density function. In D. Gardy and A. Mokkadem, editors, Mathematics and Computer Science: Algorithms, Trees, Combinatorics and Probabilities, Trends in Mathematics, pages 53{64. Birkh auser, Basel, J. A. Fill and S. Janson. Approximating the limiting Quicksort distribution. Rand. Struct. Alg., 19(3-4):376{406, 2001. T. Gernhard. The conditioned reconstructed process. J. Theor. Biol., 253: 769{778, 2008. G. Grimmett and D. Stirzaker. Probability and Random Processes (Third Edition). Oxford University Press, Oxford, 2009. S. Janson. On the tails of the limiting Quicksort distribution. Electronic Comm. Probab., 81:1{7, 2015. M. L. Kendall, M. Boyd, and C. Colijn. phyloTop, 2016. https://cran. r-project.org/web/packages/phyloTop/index.html. A. McKenzie and M. Steel. Distributions of cherries for two models of trees. Math. Biosci., 164:81{92, 2000. 45 A. Mir, F. Rossell o, and L. Rotger. A new balance index for phylogenetic trees. Math. Biosci., 241(1):125{136, 2013. E. Paradis, J. Claude, and K. Strimmer. APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20:289{290, 2004. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. URL http://www.R-project.org. M. R egnier. A limiting distribution for Quicksort. Theor. Inf. Applic., 23 (3):335{343, 1989. A. Rosalsky and M. Sreehari. On the limiting behavior of randomly weighted partial sums. Stat. & Prob. Lett., 40:403{410, 1998. U. R osler. A limit theorem for \Quicksort". Theor. Inf. Applic., 25(1): 85{100, 1991. U. R osler. A xed point theorem for distributions. Stoch. Proc. Applic., 42: 195{214, 1992. U. R osler and L. Rusc  hendorf. The contraction method for recursive algo- rithms. Algorithmica, 29:3{33, 2001. M. J. Sackin. \Good" and \bad" phenograms. Syst. Zool., 21:225{226, 1972. S. Sagitov and K. Bartoszek. Interspecies correlation for neutrally evolving traits. J. Theor. Biol., 309:11{19, 2012. V. Sosa, J. F. Ornelas, S. Ram rez-Barahona, and E. G andara. Historical reconstruction of climatic and elevation preferences and the evolution of cloud forest{adapted tree ferns in Mesoamerica. PeerJ, 4:e2696, 2016a. V. Sosa, J. F. Ornelas, S. Ram rez-Barahona, and E. G andara. Data from: Historical reconstruction of climatic and elevation preferences and the evo- lution of cloud forest{adapted tree ferns in Mesoamerica. Dryad Digital Repository, 2016b. https://doi.org/10.5061/dryad.709t8. T. Stadler. On incomplete sampling under birth-death models and connec- tions to the sampling-based coalescent. J. Theor. Biol., 261(1):58{68, 2009. 46 T. Stadler. Simulating trees with a xed number of extant species. Syst. Biol., 60(5):676{684, 2011. T. Stadler and M. Steel. Distribution of branch lengths and phylogenetic diversity under homogeneous speciation models. J. Theor. Biol., 297:33{ 40, 2012. M. Steel and A. McKenzie. Properties of phylogenetic trees generated by Yule{type speciation models. Math. Biosci., 170:91{112, 2001. G.{D. Yang, P.{M. Agapow, and G. Yedid. The tree balance signature of mass extinction is erased by continued evolution in clades of constrained size with trait{dependent speciation. PLoS ONE, 12(6):e0179553, 2017. Z. Yang. Computational Molecular Evolution. Oxford Series in Ecology and Evolution. Oxford University Press, Oxford, 2006. K. Yusim, M. Peeters, O. G. Phybus, T. Bhattacharya, E. Delaporte, C. Mu- langa, M. Muldoon, J. Theiler, and B. Korber. Using human immunode- ciency virus type 1 sequences to infer historical features of the acquired immunede ciency syndrome epidemic and human immunode ciency virus evolution. Philos. Trans. Roy. Soc. Lond. B, 356:855{866, 2001. 47 Appendix A: Mathematica code for Section 6 M a t h e m a t i c a c o d e u s e d t o o b t a i n t h e c l o s e d f o r m f o r m u l a e o f S e c t i o n 3 . S e c o n d o r d e r p r o p e r t i e s i n K . B a r t o s z e k E x a c t and a p p r o x i m a t e l i m i t b e h a v i o u r o f t h e Y u l e t r e e ' s c o p h e n e t i c i n d e x . The s c r i p t was r u n u s i n g M a t h e m a t i c a 9 . 0 f o r L i n u x x86 (64 b i t ) r u n n i n g on Ubuntu 1 2 . 0 4 . 5 LTS . I t h a s t o be n o t e d t h a t Mathematica ' s o u t p u t s h o u l d be m a n u a l l y p o s t p r o c e s s e d i n o r d e r t o h a v e t h e f o r m u l a e i n t e r m s o f h a r m o n i c sums and n o t d e r i v a t i v e s o f polygamma f u n c t i o n s . A l l t h e r e f e r e n c e s i n t h i s s c r i p t p o i n t t o a p p r o p r i a t e f r a g m e n t s o f t h e m a n u s c r i p t . We c h o o s e t h e p a i r s i n o r d e r , i . e . f i r s t t h e f i r s t p a i r t o c o a l e s c e t h e n t h e s e c o n d p a i r t o c o a l e s c e . ( Compare w i t h p r o o f o f Lemma 1 o f B a r t o s z e k and S a g i t o v ( 2 0 1 5 b )  ) F c o a l P r o b [ n , k , c ]= F u l l S i m p l i f y [ P r o d u c t [ ( 1 c / ( ( r ( r 1 ) ) / 2 ) ) ,f r , k +2 , ng ] ] ( D e f . 2 . 3 , Eq . ( 1 )  ) E1k [ n , k ] : = ( 2 ( n + 1 ) / ( ( n 1 ) ( k + 1 ) ( k + 2 ) ) ) ( Lemma 6 . 1 , Eq . ( 1 8 )  ) Var1k [ n , k ] : = ( E1k [ n , k]E1k [ n , k ] E1k [ n , k ] ) ( Lemma 6 . 2 , Eq . ( 1 9 )  ) Cov1k11k2 [ n , k 1 , k 2 ] := ( E1k [ n , k1 ] E1k [ n , k2 ] ) ( Lemma 6 . 3 , Eq . ( 2 0 )  ) VarE1k [ n , k ] = ( F u l l S i m p l i f y [ ( 1 / ( n ( n 1 ) / 2 ) ) ( 2 ( n + 1 ) / ( ( n 1 ) ( k + 1 ) ( k + 2 ) ) ) ( 2 ( n 2 ) / ( n ( n 1 ) / 2 ) ) (Sum [ F c o a l P r o b [ n , j , 3 ] ( 1 / ( ( j +1) j / 2 ) ) F c o a l P r o b [ j , k , 1 ] ( 1 / ( ( k +1) k / 2 ) ) ,f j , k +1 ,n 1g ] ) 48 + ( ( n 2 ) ( n 3 ) / 2 / ( n ( n 1 ) / 2 ) ) ( Sum [ Sum [ F c o a l P r o b [ n , j 1 , 6 ] ( 4 / ( ( j 1 +1) j 1 / 2 ) ) F c o a l P r o b [ j 1 , j 2 , 3 ] ( 1 / ( ( j 2 +1) j 2 / 2 ) ) F c o a l P r o b [ j 2 , k , 1 ] ( 1 / ( ( k +1) k / 2 ) ) ,f j 1 , j 2 +1 ,n 1g ] ,f j 2 , k +1 ,n2g] ) ( E1k [ n , k ] ) ( E1k [ n , k ] ) ] ) ( Lemma 6 . 4 , Eq . ( 2 1 )  ) CovE1k1E1k2 [ n , k 1 , k 2 ] = ( F u l l S i m p l i f y [ ( 2 ( n 2 ) / ( n ( n 1 ) / 2 ) ) ( F c o a l P r o b [ n , k2 , 3 ] ( 1 / ( ( k2 +1) k2 / 2 ) ) F c o a l P r o b [ k2 , k1 , 1 ] ( 1 / ( ( k1 +1) k1 / 2 ) ) ) ( ( n 2 ) ( n 3 ) / 2 / ( n ( n 1 ) / 2 ) ) ( F c o a l P r o b [ n , k2 , 6 ] ( 1 / ( ( k2 +1) k2 / 2 ) ) F c o a l P r o b [ k2 , k1 , 3 ] ( 1 / ( ( k1 +1) k1 / 2 ) ) Sum [ F c o a l P r o b [ n , k2 , 6 ] ( 1 / ( ( k2 +1) k2 / 2 ) ) F c o a l P r o b [ k2 , j , 3 ] ( 2 / ( ( j +1) j / 2 ) ) F c o a l P r o b [ j , k1 , 1 ] ( 1 / ( ( k1 +1) k1 / 2 ) ) ,f j , k1 +1 , k21g] Sum [ F c o a l P r o b [ n , j , 6 ] ( 4 / ( ( j +1) j / 2 ) ) F c o a l P r o b [ j , k2 , 3 ] ( 1 / ( ( k2 +1) k2 / 2 ) ) F c o a l P r o b [ k2 , k1 , 1 ] ( 1 / ( ( k1 +1) k1 / 2 ) ) ,f j , k2 +1 ,n1g] )(E1k [ n , k1 ] ) ( E1k [ n , k2 ] ) ] ) ( Thm . 6 . 1 , Eq . ( 2 2 )  ) EVi [ n , i ] : = ( F u l l S i m p l i f y [ Sum [ E1k [ n , k ] ,f k , i , n 1g ] / i ] ) ( Thm . 6 . 2 , Eq . ( 2 3 )  ) VarVi [ n , i ] = ( F u l l S i m p l i f y [ ( Sum [ VarE1k [ n , k ] ,f k , i , n1g] +2Sum [ Sum [ CovE1k1E1k2 [ n , k1 , k2 ] ,f k2 , k1 +1 ,n 1g ] ,f k1 , i , n 1g ] ) / ( i i ) ] ) ( Thm . 6 . 3 , Eq . ( 2 4 )  ) CovVi1Vi2 [ n , i 1 , i 2 ] = ( F u l l S i m p l i f y [ ( i 2 i 2 VarVi [ n , i 2 ] +Sum [ Sum [ CovE1k1E1k2 [ n , k1 , k2 ] ,f k2 , i 2 , n 1g ] ,f k1 , i 1 , i 2 1g ] ) / ( i 1 i 2 ) ] ) ( Thm . 6 . 4 , f o r m u l a 1  ) EVi2 [ n , i ] = ( F u l l S i m p l i f y [ VarVi [ n , i ] + ( EVi [ n , i ] ^ 2 ) ] ) VarSumVi [ n ] = ( F u l l S i m p l i f y [ Sum [ EVi2 [ n , i ] ,f i , 1 , n1g] +2Sum [ Sum [ CovVi1Vi2 [ n , i 1 , i 2 ] ,f i 2 , i 1 +1 ,n 1g ] ,f i 1 , 1 , n 2g ] ] ) 49 ( Thm . 6 . 4 Eq . ( 1 3 ) , f o r m u l a 2  ) VarWn [ n ] = ( F u l l S i m p l i f y [ 2 Sum [ VarVi [ n , i ] ,f i , 1 , n1g] +Sum [ ( EVi [ n , i ] ) ^ 2 ,f i , 1 , n1g]+2Sum [ Sum [ CovVi1Vi2 [ n , i 1 , i 2 ] ,f i 2 , i 1 +1 ,n 1g ] ,f i 1 , 1 , n 1g ] ] ) ( Thm . 6 . 4 Eq . ( 1 3 ) , f o r m u l a 3  ) VarWnBar [ n ] = ( F u l l S i m p l i f y [ Sum [ ( EVi [ n , i ] ) ^ 2 ,f i , 1 , n 1g ] ] ) ( Thm . 6 . 4 Eq . ( 1 3 ) , f o r m u l a 4  ) VarWnCentre [ n ] = ( F u l l S i m p l i f y [ 2 Sum [ VarVi [ n , i ] ,f i , 1 , n1g] +2Sum [ Sum [ CovVi1Vi2 [ n , i 1 , i 2 ] ,f i 2 , i 1 +1 ,n 1g ] ,f i 1 , 1 , n 1g ] ] ) ( Thm . 6 . 5  ) F i n a l P a r t [ n ] = (Sum [ ( i + 1 ) ( i + 5 ) / ( i 1 ) ,f i , 2 , n 2g ] ) Appendix B: Counterparts of R osler (1991)'s Prop. 3:2 for the cophenetic index Lemma 7.1 De ne for i 2 f1; : : : ; ng h i h i h i 2 (i) (ni) (n) ~ ~ ~ ~ C (i) = n E  + E  E  + i(n i) and C (x) = 0:5 3x(1 x) for x 2 [0; 1], then 1 1 ~ ~ sup jC (dnxe) C (x)j  2n ln n + O(n ): x2[0;1] Proof Writing out 2 2 2 2 C (i) = n (i + i 2iH + (n i) + (n i) 2(n i)H n n i;1 ni;1 n + 2nH + i(n i) n;1 2 2 1 2 1 = n 3i 3in + n + 2nH n 2iH 2(n i)H n;1 i;1 ni;1 2 2 1 i i 1 < 3 1 + 2n ln n 2 n n Therefore, assuming that 1  dnxe  n 1 50 dnxe dnxe ~ ~ jC (dnxe) C (x)j  3j (1 ) x(1 x)j + 2n ln n n n 1 6 1 2 ~ ~ sup jC (y) C (z)j + 2n ln n  + 2n ln n + O(n ): jyzj<1=n If dnxe = n, we notice that x 2 (1 1=n; 1] and directly obtain 1 1 ~ ~ jC (dnxe) C (x)j  3jx(1 x)j + 2n ln n  2n ln n + : Lemma 7.2 De ne for i 2 f1; : : : ; ng, T; T  exp(2) h i h i h i 1 i n i (i) (ni) (n) 0 0 C (i; T; T ) = E  + E  E  + T + T NRE NRE NRE n 2 2 and for x 2 [0; 1], T; T  exp(2) 1 1 0 2 2 0 C (x; T; T ) = x T + (1 x) T x(1 x) 2 2 then 0 0 1 1 sup jC (dnxe; T; T ) C (x; T; T )j  n ln n + O(n ) + B ; n n x2[0;1] where B is a positive random variable that converges to 0 almost surely with 1 2 expectation decaying as O(n ) and second moment as O(n ). Proof Similarly, as in the proof of Lemma 7.1 we write out i ni 0 2 0 1 2 1 2 C (i; T; T ) = n T + T + (i + i) iH + ((n i) + (n i)) n i;1 2 2 2 2 1 2 (n i)H (n n) + nH ni;1 n;1 2 2 1 i 1 ni 0 i i 1 1 i ni 0 < T + T 1 + n ln n T + T : 2 2 2 n 2 n n n 2 n n i ni 0 We denote A = (1=2) T + T and notice that it converges almost n 2 2 n n surely to 0 with n. Now, assuming that 1  dnxe  n 1 dnxe jC (dnxe) C (x)j  j x jT 2 n dnxe dnxe dnxe 1 2 0 1 + j 1 (1 x) jT +j (1 ) x(1 x)j + n ln n + A 2 n n n 1 2 2 1 2 2 0 < sup jy z jT + sup jy z jT + sup jy(1 y) + z(1 z)j 2 2 jyzj<1=n jyzj<1=n jyzj<1=n +n ln n + A 1 2 1 2 0 2 2 1 (n + O(n ))T + (n + O(n ))T + + O(n ) + n ln n + A : If dnxe = n, we notice that x 2 (1 1=n; 1] and directly obtain 1 2 1 2 0 1 1 jC (dnxe) C (x)j  n T + n T + n + n ln n + A : n n 2 2 Therefore, if we now denote 1 2 1 2 0 B = A + (n + O(n ))T + (n + O(n ))T n n we obtain the statement of the Lemma. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Mathematics arXiv (Cornell University)

Exact and approximate limit behaviour of the Yule tree's cophenetic index

Mathematics , Volume 2020 (1703) – Mar 27, 2017

Loading next page...
 
/lp/arxiv-cornell-university/exact-and-approximate-limit-behaviour-of-the-yule-tree-s-cophenetic-dqyzy0McrN
ISSN
0025-5564
eISSN
ARCH-3343
DOI
10.1016/j.mbs.2018.05.005
Publisher site
See Article on Publisher Site

Abstract

In this work we study the limit distribution of an appropriately normalized cophenetic index of the pure{birth tree conditioned on n contemporary tips. We show that this normalized phylogenetic bal- ance index is a submartingale that converges almost surely and in L . We link our work with studies on trees without branch lengths and show that in this case the limit distribution is a contraction{type distribution, similar to the Quicksort limit distribution. In the contin- uous branch case we suggest approximations to the limit distribution. We propose heuristic methods of simulating from these distributions and it may be observed that these algorithms result in reasonable tails. Therefore, we propose a way based on the quantiles of the derived dis- tributions for hypothesis testing, whether an observed phylogenetic tree is consistent with the pure{birth process. Simulating a sample by the proposed heuristics is rapid, while exact simulation (simulating the tree and then calculating the index) is a time{consuming procedure. We conduct a power study to investigate how well the cophenetic in- dices detect deviations from the Yule tree and apply the methodology to empirical phylogenies. Keywords : Contraction type distribution; Cophenetic index; Martin- gales; Phylogenetics; Signi cance testing 1 Introduction Phylogenetic trees are now a standard when analyzing groups of species. They are inferred from molecular sequences by algorithms that often assume arXiv:1703.08954v3 [q-bio.PE] 30 Apr 2018 a Markov chain for mutations at the individual positions of the genetic se- quence (e.g. Ewens and Grant, 2005; Felsenstein, 2004; Yang, 2006). Given a phylogenetic tree it is often of interest to quantify the rate(s) of speciation and extinction for the studied species. To do this one commonly assumes a birth{death process with constant rates. However, the development of formal statistical tests whether a given tree comes from a given branching process model is an open area of research (see the still relevant \Work remaining" part at the end of Ch. 33 in Felsenstein, 2004). The reason for the apparent lack of widespread use of such tests (but see Blum and Fran cois, 2005) could be the lack of a commonly agreed on test statistic. This is as a tree is a complex object and there are multiple ways in which to summarize it in a single number. One proposed way of summarizing a tree is through indices that quantify how balanced it is, i.e. how close is it to a fully symmetric tree. Two such indices have been with us for many years now: Sackin's (Sackin, 1972) and Colless' (Colless, 1982). Alternatively, McKenzie and Steel (2000) proposed to measure balance by counting cherries on the tree and they showed that after appropriate centring and scaling, this index converges to the standard normal distribution (for examples of other indices see Ch. 33 in Felsenstein, 2004). Recently, a new balance index was proposed|the cophenetic index (Mir et al., 2013). The work here is inspired by private communication with evolutionary biologist Gabriel Yedid (current aliation Nanjing Agricultural University, Nanjing, China) concerning the usage of the cophenetic index for signi cance testing of whether a given tree is consistent with the pure{ birth process. He noticed that simulated distributions of the index have much heavier tails than those of the normal and t distributions and hence, comparing centred and scaled cophenetic indices with the usual Gaussian or t quantiles is not appropriate for signi cance testing. It would lead to a higher false positive rate|rejecting the null hypothesis of no extinction when a tree was generated by a pure{birth process. Our aim here is to propose an approach for working analytically with the cophenetic index, especially to improve hypothesis tests for phylogenetic trees, i.e. how to recognize if the tree is out of the \Yule zone" (Yang et al., 2017). We show that there is a relationship between the cophenetic index and the Quicksort algorithm. This suggests that the methods exploring (e.g. Fill and Janson, 2000, 2001; Janson, 2015) the limiting distribution of the Quicksort algorithm can be an inspiration for studying analytical properties 2 of the cophenetic index. The paper is organized as follows. In Section 2 we formally de ne the cophenetic index (for trees with and without branch lengths) and present the most important results of the manuscript. We de ne an associated sub- martingale that converges almost surely and in L (Thm. 2.4), propose an elegant representation (Thm. 2.7) and a very promising approximation (Def. 2.8). Afterwards in Section 3, we show that in the discrete setting the limit law of the normalized cophenetic index is a contraction{type distribution. Based on this we propose alternative approximations to the limit law of the normalized (with branch lengths) cophenetic index. In Section 4 we describe heuristic algorithms to simulate from these limit laws, show simulated quan- tiles, explore the power of the cophenetic index to recognize deviations from the Yule tree (comparing with Sackin's and Colless' indices' powers), and ap- ply the indices to example empirical data. In Section 5 we prove the claims presented in Section 2 alongside other supporting results. Then, in Section 6 we study the second order properties of this decomposition and conjecture a Central Limit Theorem (CLT, Rem. 6.10). We end the paper with Section 7 by describing alternative representations of the cophenetic index. 2 The cophenetic index and summary of main results Mir et al. (2013) recently proposed a new balance index for phylogenetic trees. De nition 2.1 (Mir et al. (2013)) For a given phylogenetic tree on n tips and for each pair of tips (i; j) let  be the number of branches from the root ij to the most recent common ancestor of tips i and j. We then the de ne the discrete cophenetic index as (n) (n) ~ ~ =  : ij 1i<jn Mir et al. (2013) show that this index has a better resolution than the \tra- ditional" ones. In particular the cophenetic index has a range of values of 3 2 the order of O(n ) while Colless' and Sackin's ranges have an order of O(n ). (n) Furthermore, unlike the other two previously mentioned,  makes mathe- matical sense for trees that are not fully resolved (i.e. not binary). 3 In this work we study phylogenetic trees with branch lengths and hence consider a variation of the cophenetic index. De nition 2.2 For a given phylogenetic tree on n tips and for each pair of tips (i; j) let  be the time from the most recent common ancestor of tips i ij and j to the root/origin (depending on the tree model) of the tree. We then de ne the continuous cophenetic index as (n) (n) =  : ij 1i<jn Remark 2.3 In the original setting, when the distance between two nodes was measured by counting branches, Mir et al. (2013) did not consider the edge leading to the root. In our work here, where our prime concern is with trees with random branch lengths, we include the branch leading to the root. This is not a big di erence, one just has to remember to add to each distance between nodes the same exponential (exp(1)|parametrization by the rate) random variable (see Section 5 for description of the tree's growth). The results of the present manuscript are built around a scaled version of the cophenetic index which is an almost surely and L convergent submartin- gale. We rst introduce some notation. Let Y be the {algebra containing all the information on the Yule with n tips tree and de ne H := 1=k : n;m k=1 Below we present the main results concerning the cophenetic index, leaving the proofs and supporting theorems for Section 5. Theorem 2.4 Consider a scaled cophenetic index (n) W =  : W is a positive submartingale that converges almost surely and in L to a nite rst and second moment random variable. (n) De nition 2.5 For k = 1; : : : ; n1 let us de ne 1 as the indicator random variable taking the value of 1 if a randomly sampled pair of species coalesced at the k{th (counting from the origin of the tree) speciation event. 4 We know (e.g. Bartoszek and Sagitov, 2015b; Stadler, 2009; Steel and McKen- zie, 2001) that h i n + 1 1 (n) (n) P(1 = 1) = E 1 = 2   : (1) n;k k k n 1 (k + 1)(k + 2) De nition 2.6 For i = 1; : : : ; n 1 let us introduce the random variable n1 h i (n) (n) V := E 1 jY : (2) i k k=i Theorem 2.7 W can be represented as n1 (n) W = V Z ; (3) n i i=1 where Z ; : : : ; Z are i.i.d. exp(1) random variables. 1 n1 De nition 2.8 De ne the random variable W as n1 h i (n) W = E V Z ; (4) n i i=1 where Z ; : : : ; Z are i.i.d. exp(1) random variables. 1 n1 Remark 2.9 Despite the apparent elegance, it is not straightforward to de- rive a Central Limit Theorem (CLT) or limit statements concerning W from the representation of Eq. (3). Initially one could hope (based on \typical" results on limits for randomly weighted sums, e.g. Thm. 1 of Rosalsky and Sreehari, 1998) that W could converge a.s. to a random variable that has the same limiting distribution as W . Similarly, as in the proof of Thm. 2.4 in Section 5, because ((n + 2)(n 1)=(n(n + 1)) > 1, we have that W is an L bounded submartingale (n + 2)(n 1) 2 E W jW = W + > W : n+1 n n n n(n + 1) n (n + 1) Hence, W converges almost surely. Figure 1 can easily mislead one to be- lieve in the equality of the limiting distributions of W and W . However, n n in Thm. 6.8 we can see that Var [W ] and Var W convergence to di er- n n ent limits. Therefore, W and W cannot converge in distribution to the n n same limit. However, as we shall see in Section 4, W provides a reasonable approximation (and importantly extremely cheap, in terms of computational time and memory) to W in the sense of their distributions. 0 2 4 6 8 10 Figure 1: The curves are density estimates, via R's (R Core Team, 2013) density() function, of W 's density (black) and W 's density (gray). They n n are based on simulated values of W from 10000 simulated 500{tips Yule trees with  = 1. To obtain a sample from W , independent exp(1) random variables were drawn. The simulated sample of W has mean 2, variance 1:214, skewness 1:609 and excess kurtosis 4:237 while the simulated sample of W has mean 1:973, variance 1:109, skewness 1:634 and excess kurtosis 4:159. It is obvious that E [W ] = E W , but we have shown that their n n variances di er (simulations agree with Thm. 6.8). De nition 2.10 We naturally de ne the scaled discrete cophenetic index as (n) ~ ~ W =  : (5) 0.0 0.1 0.2 0.3 0.4 0.5 2 Theorem 2.11 W is an almost surely and L convergent submartingale. The applied reader will be most interested in how the results here can be practically used. As written already in the Introduction balance indices are often used to provide a single{number summary of the tree's shape. Such statistics can be then used e.g. to test if the tree is consistent with some null model (here the Yule model). Naturally, there has been extensive work on using di erent balance indices for signi cance testing (e.g. Agapow and Purvis, 2002; Blum and Fran cois, 2005; Yang et al., 2017). However previous works nearly always worked with indices that only considered the topology and often obtained the rejection regions through direct simulations. Unfortunately, looking only at the tree's topology will not allow for dis- tinguishing between some models. In particular (as seen in Tab. 3) there is no di erence (from the topological indices perspective) between a Yule tree, a constant rate birth{death tree and a coalescent tree. Hence, a temporal index that also takes into account the branch lengths should be used (as in- dicated in the \Work remaining" section at the end of Ch. 33 in Felsenstein, (n) 2004). A statistic based on  performs signi cantly better (but in these (n) cases still leaves a lot to be desired). However,  shows it true useful- ness when employed to distinguish a biased speciation (Blum and Fran cois, 2005) from a Yule model. Blum and Fran cois (2005) indicated that there is a regime where topological indices fail completely. Table 3 shows that in this setup (and also certain others) the temporal index in superior in recognizing the deviation from the Yule tree. Directly simulating a tree from a null model (Yule here) and then calcu- lating the index will of course give a sample from the correct null distribution. However, this approach is costly both in terms of time and memory. There- fore, if theoretical results that provide equivalent, asymptotic or approximate representations of the index's law are available they could speed up any study by orders of magnitude. In fact this is clearly visible in Tab. 1, calculating the cophenetic index directly from a sample of simulated pure{birth trees is over 170 times slower than considering W . Even more dramatically one can obtain a sample from an approximation to the equivalent representation of (n) the asymptotic distribution of  (after normalization) nearly 3000 times faster than directly sampling the discrete cophenetic index. In Alg. 1 we describe how the presented here approach can be used for signi cance testing. Then, in Section 4 we discuss in detail the required computational procedures, present simulation results concerning the power of 7 the tests and apply the tests to empirical data. Preceding this computational (n) Section is a characterization of the limit distribution of (normalized)  and (n) another proposal to approximate the limit of (normalized)  . This section justi es the described simulation algorithms in Section 4. Algorithm 1 Signi cance testing (n) 1: input: A phylogenetic tree T with n tips and signi cance level (n) 2: output: A decision if the null hypothesis of T coming from a pure{ birth process is rejected (TRUE) or not (FALSE) 3: Correct, when necessary, the tree for the speciation rate, by multiplying all branch lengths by , if cophenetic index with branch lengths is used. . See Section 4.2. (n) 4: Calculate , T 's cophenetic index . Exactly which version is used, (n) (n) (n) (n) ~ ~ ,  ,  ,  , depends on the particular tree, if it has branch NRE NRE lengths or root edge 5: Standardize  as X = ( E [])= Var [] . E [] and Y ule Y ule Y ule Var [] depend which version of the cophenetic index is considered. Y ule In Thm. 2.12 all the possibilities are presented. 6: Obtain the quantiles q ( =2), q (1 =2) (if test is two{sided), Y ule Y ule q ( ) (left{tailed), q (1 ) (right{tailed) of X under the Yule Y ule Y ule model, i.e. P (X  q ( )) = . . Exactly how to Y ule obtain the quantiles is a matter of which version of the cophenetic index is used and computational resources (see Section 4). 7: if X is inside rejection region then return TRUE 8: else 9: return FALSE 10: end if Theorem 2.12 A random variable with subscript NRE (no root{edge) indi- cates that this random variable comes from a tree lacking the edge leading to 8 the root. n 2(nH ) (n) n;1 2 E  =  n 2 n1 h i (n) 2(nH ) n;1 1 2 E  = 1  n NRE 2 n1 2 ( ) (n) 2 2 2 4 3 Var  = (12n (n 6n 4)H 9n + 102n 2 2 n1;2 9n (n1) +51n 24nH 72n 72) n1;1 (6) 2 4 (2 9) n h i 2 (n) ( ) 2 2 2 4 3 Var  = (12n (n 6n 4)H 9n + 102n 2 2 n1;2 NRE 9n (n1) +51n 24nH 72n 72) n1;1 1 2 4 (2 18) n h i n 4(nH ) (n) n;1 2 E  = 1  3n =2 2 n1 h i (n) 4(nH ) n n;1 E  = 2  n NRE 2 n1 h i (n) 1 4 3 2 2 4 Var  = (n 10n + 131n 2n) 4n H 6nH  n =12 n;2 n;1 h i (n) 1 4 3 2 2 4 Var  = (n 10n + 131n 2n) 4n H 6nH  n =12 n;2 n;1 NRE 12 (7) Proof The proof of the expectation part is due to Mir et al. (2013); Sagitov (n) and Bartoszek (2012). The variance of  is due to Cardona et al. (2013); (n) Mir et al. (2013). The variance of  is a consequence of the lemmata and theorems presented in Section 6. When the root edge is not included, then we have to decrease the expectation by . This is due to each pair of tips \having" the root edge included in the cophenetic distance between them. In the case of branch lengths, the expectation of the root edge, distributed as exp(1), is one. Without a root edge for the same reason the variance of (n) has to be decreased by . In the discrete case the root edge has a deterministic length of 1 and hence no e ect on the variance. 3 Contraction{type limit distribution Even though the representation of Eq. (3) is a very elegant one, it is not obvious how to derive asymptotic properties of the process from it (compare 9 Section 6). We turn to considering the recursive representation proposed by Mir et al. (2013) L R (n) (L ) (R ) n n n n ~ ~ ~ =  +  + + ; (8) NRE NRE NRE 2 2 where L and R are the number of left and right daughter tip descendants. n n Obviously L + R = n. n n From Eq. (8) we will be able to deduce the form of the limit of the process. In the case with branch lengths we attempt to approximate the cophenetic index with the following contraction{type law L R (n) (L ) (R ) n n n n 0 =  +  + T + T ; (9) 0:5 NRE NRE NRE 0:5 2 2 where T , T are independent exp(2) random variables (we index with the 0:5 0:5 mean to avoid confusion with T , Section 5, the time between the second and third speciation event which is also exp(2) distributed). These are the branch lengths leading from the speciation point. The rationale behind the choice of distribution is that a randomly chosen internal branch of a conditioned Yule tree with rate 1 is exp(2) distributed (Cor. 3:2 and Thm. 3:3 Stadler and Steel, 2012). This is of course an approximation, as we cannot expect that the laws of the branch lengths with the depth of the recursion should become indistinguishable from the law of the average branch. In fact, we should expect that the law of Eq. (9) has to depend on n, i.e. the level of the recursion. For larger n the branches have distributions concentrated on smaller values, e.g. compare the randomly sampled root adjacent branch length law (Thm. 5:1 Stadler and Steel, 2012) with the law of the average branch length. However, as we shall see simulations indicate that approximating with the average law still could still yield acceptable heuristics, but not as good as (n) (n) the approximation by W . We use the notation  ,  to di erentiate NRE NRE (n) (n) from  ,  where the root branch is included, i.e. n n (n) (n) (n) (n) ~ ~ =  + and  =  + T ; where T  exp(1): 1 1 NRE NRE 2 2 De ne now 10  h i  h i (n) (n) (n) (n) (n) 2 (n) 2 ~ ~ ~ Y = n  E  Y = n  E NRE NRE NRE NRE and using Eqs. (6) and (7) we obtain the following recursions 2 2 L R (n) L (L ) R (R ) 2 n 2 n 0 n n n n Y = Y + Y + n T + n T 0:5 0:5 n  h n i h 2 i h 2i (L ) (R ) (n) 2 n n +n E  jL + E  jR E n n NRE NRE NRE and 2 2 L R L R (n) n (L ) n (R ) 2 n 2 n ~ ~ n ~ n Y = Y + Y + n + n n n 2 2 h i h i h i (L ) (R ) (n) n n ~ ~ ~ +n E  jL + E  jR E  : n n NRE NRE NRE (n) ~ ~ The process Y is related to the process W as h i (n) 1 (n) ~ ~ ~ W = 2(1 + n )Y + E  : NRE In the continuous case we do not have an exact equality, we rather hope for h i (n) 1 (n) W  2(1 + n )Y + E  + T n 1 NRE in some sense of approximation. Hence, knowledge of the asymptotic be- (1) (1) (1) haviour of Y , Y will immediately give us information about W , (1) W in the obvious way (1) (1) ~ ~ W = 2Y + 2 (1) (1) W  2Y + 1 + T : (n) (n) The processes Y , Y look very similar to the scaled recursive represen- tation of the Quicksort algorithm (e.g. R osler, 1991). In fact, it is of interest that, just as in the present work, a martingale proof rst showed convergence of Quicksort (R egnier, 1989), but then a recursive approach is required to show properties of the limit. The random variable L =n !   Unif [0; 1] weakly and as weak convergence is preserved under continuous transfor- mations (Thm. 18, p. 316 Grimmett and Stirzaker, 2009) we will have 11 2 2 (L =n) !  weakly. Therefore, we would expect the almost sure limits to satisfy the following equalities in distribution (remembering the asymptotic behaviour of the expectations) 0 00 1 1 (1) 2 (1) 2 (1) 2 2 0 Y =  Y + (1  ) Y +  T + (1  ) T  (1  ); (10) 0:5 0:5 2 2 and 0 00 1 (1) 2 (1) 2 (1) ~ ~ ~ Y =  Y + (1  ) Y + 3 (1  ) (11) 0 00 (1) (1) (1) where  is uniformly distributed on [0; 1], Y , Y and Y are identi- 0 00 0 (1) (1) (1) (1) ~ ~ ~ cally distributed random variables, so are Y , Y and Y , and Y , 00 0 00 (1) (1) (1) ~ ~ Y , Y and Y are independent. Following R osler (1991)'s approach it turns out that the limiting distributions do satisfy the equalities of Eqs. (10) and (11). Let D be the space of distributions with zero rst moment and nite second moment. We consider on D the Wasserstein metric d(F; G) = inf kX Yk 2: XF;YG Theorem 3.1 Let F 2 D and assume that Y; Y  F ,   Unif [0; 1], 0 0 0 T ; T  exp(2) and Y; Y ; ; T; T are all independent. De ne transfor- 0:5 0:5 mations S : D ! D, S : D ! D as 1 2 1 1 2 2 0 2 2 0 S (F ) =  Y + (1  ) Y +  T + (1  ) T  (1  ); (12) 1 0:5 0:5 2 2 and 0 1 2 2 S (F ) =  Y + (1  ) Y + 3 (1  ) (13) respectively. Both transformations S and S are contractions on (D; d) and 1 2 converge exponentially fast in the d{metric to the unique xed points of S and S respectively. Remark 3.2 The proof of Thm. 3.1 is the same as R osler (1991)'s proof of his Thm. 2:1. However, compared to the Quicksort algorithm (R osler, 1991) 12 p p we will have a 2=5 upper bound on the rate of decay instead of 2=3. This 2 2 speed{up should be expected as we have  and (1  ) instead of  and (1 ). Thm. 3.1 can also be seen as a consequence of R osler (1992)'s more general Thms. 3 and 4. The rate of convergence is also a consequence of the general contraction lemma (Lemma 1, R osler and Ruschendorf,  2001). Now, using Lemmata 7.1, 7.2 (their proofs in 7.2 di er only in detail from the proof of Prop. 3:2 in R osler, 1991) and arguing in the same way as R osler (1991) did in his Section 3, especially his proof of his Thm. 3:1 we obtain (n) (n) (1) (1) ~ ~ that Y and Y converge in the Wasserstein d{metric to Y and Y whose laws are xed points of S and S respectively. A minor point should 1 2 4 2 be made. Here, we will have (i=n) instead of (i=n) in a counterpart of R osler (1991)'s Prop 3:3. Remark 3.3 One may directly obtain from the recursive representation that h i (1) (1) (1) (1) ~ ~ E Y = EY = 0, Var Y = 1=16 = 0:0635 and Var Y = 1=12. We can therefore, see that in the discrete case the variance agrees. However, in the continuous case we can see that it slightly di ers Var [(W T )=2] !  =18 0:5  0:048: n 1 Remark 3.4 One can of course calculate what the mean and variance of 0 (1) (1) T , T should be so that E Y = 0 and Var Y = Var [(W T )=2]. 0:5 n 1 0:5 0 0 2 We should have E [T ] = E [T ] = 0:5 and Var [T ] = Var [T ] =  =3 0:5 0:5 0:5 0:5 25=8. This, in particular, means that these branch lengths cannot be expo- nentially distributed. We therefore, also experimented by drawing T , T 0:5 0:5 from a gamma distribution with rate equalling 1=(2( =3 25=8)) and shape equalling  =6 25=16. However, this signi cantly increased the duration of the computations but did not result in any visible improvements in compari- son to Tab. 1. 4 Signi cance testing 4.1 Obtaining the quantiles Algorithm 1 requires knowledge of the quantiles of the underlying distribution in order to de ne the rejection region. Unfortunately, an analytical form of 13 the density of any scaled cophenetic index is not known so one will have to resort to some sort of simulations to obtain the critical values. Directly simulating a large number of pure{birth trees can take an overly long time, measured in minutes (on a modern machine with a large amount of memory, or hours on an older one). Fortunately, the cophenetic index can be calculated in O(n) time (Cor. 3 Mir et al., 2013) and such a tree{traversing algorithm (n) (n) was employed to obtain  and  . On the other hand, the suggestive (but wrong) approximations of Eq. (4) and contraction limiting distributions Eqs. (10) and (11) are signi cantly faster to simulate, see Tab. 1. Simulating from the approximate Eq. (4) is straightforward. One simply draws n 1 independent exp(1) random variables. Simulating random vari- ables satisfying Eqs. (10) and (11) is more involved and it may be possible to develop an exact rejection algorithm (cf. Devroye et al., 2000). Here, we choose simple, approximate but still e ective, heuristics in order to demon- strate the usefulness of the approach for signi cance testing. We now describe algorithms (Algs. 2 and 3) for simulating from a more general distribution, F , that satis es 0 00 Y = g ( )Y + g ( )Y + C (; ); (14) 1 2 0 00 0 00 where Y; Y ; Y  F , Y ; Y ; ;  are independent,   F ,   F is some random vector, g ; g : R ! R and C : R ! R for some appropriate p that 1 2 depends on 's dimension. Of course in our case here we have   Unif [0; 1], 2 2 g ( ) =  , g ( ) = (1  ) , 1 2 0 2 2 0 C (; T; T ) =  T=2 + (1  ) T =2  (1  ) and C ( ) = 1=2 3 (1  ) (n) (n) 0 for  ,  respectively. Of course, T , T are independent and exp(2) dis- tributed. If one considers also the root edge, then to the simulated random 2 (n) variable one needs to add T  exp(1) when simulating n  or appropri- 2 (n) ately 1 if one considers n  . The recursion of Alg. 3 for a given realization of  and  random variables can be directly solved. However, from numerical experiments implementing Alg. 3 iteratively seemed computationally ine ective. In Tab. 1 we report on the simulations from the di erent distribution. For each distribution we draw a sample of size 10000 and repeat this 100 14 Algorithm 2 Population approximation 1: Initiate population size N 2: Set P [0; 1 : N ] = Y . Initial population 3: for i = 1 to i do max 4: f =density(P [i 1; ]) . density estimation by R i1 5: for j = 1 to N do 6: Draw  from F 7: Draw  from F 8: Draw Y , Y independently from f 1 2 i1 9: P [i; j] = g ( )Y + g ( )Y + C (; ) 1 1 2 2 10: end for 11: end for 12: return P [i ; ] max 13: . Add root branch (exp(1) or 1) if needed for each individual. Algorithm 3 Recursive approximation 1: procedure Yrecursion(n, Y ) 2: if n = 0 then 3: Y = Y , Y = Y 1 0 2 0 4: else if n = 1 and Y = 0 then 5: Draw  ,  independently from F 1 2 6: Draw  ,  independently from F 1 2 7: Y = C ( ;  ) 1 1 1 8: Y = C ( ;  ) 2 2 2 9: else 10: Y =Yrecursion(n 1 , Y ) 1 0 11: Y =Yrecursion(n 1 , Y ) 2 0 12: end if 13: Draw  from F 14: Draw  from F 15: return g ( )Y + g ( )Y + C (; ); 1 1 2 2 16: end procedure 17: return Yrecursion(N , Y ) 18: . Add root branch (exp(1) or 1) if needed. 15 times. We compare the quantiles from the di erent distributions. We can see that the approximation of W for W is a good one and can be used n n when one needs to work with the distribution of the cophenetic index with branch lengths. In the case of the discrete cophenetic index we have found an exact limit distribution which is a contraction{type distribution. Therefore, one can relatively quickly simulate a sample from it without the need to do lengthy simulations of the whole tree and then calculations of the cophenetic index. Unfortunately, this contraction approach does not seem to give such good results in the Yule tree with branch lengths case. We used an approxi- mation when constructing the contraction. Instead of taking the law of the length of two daughter branches, we took the law of an random internal branch. This induces a di erence between the tails of the distributions that is clearly visible in the simulations. Even at the second moment level there is a di erence. We calculated (Thm. 6.8) that Var [W ] ! 2 =9 1  1:193, 2 (n) Var W ! 4 =3 12  1:159 while Var 2Y + T = 1:25. Therefore, n 1 the approximation by W seems better already at the second moment level. Generally if one cannot a ord the time and memory to simulate a large sam- ple of Yule tree, simulating W values seems an attractive option, as the discrepancy between the two distributions seems small. In Fig. 2 we compare the density estimates of (scaled and centred) both continuous and discrete branches cophenetic indices and their respective contraction{type limit distributions. The density estimates generally agree (n) but we know from Tab. 1 that for  this is only an approximation. We simulated 10000 Yule trees and hence we report only the quantiles between 2:5% and 97:5%. Quantiles further out in the tails seemed less accurate and hence are not included in the table. Similarly, we can see less correspondence between the di erent estimates of kurtosis. This statistic relies on fourth mo- ments and hence is more sensitive to the tails. On the other hand we can see much greater Monte Carlo error for the kurtosis in all simulations, including the setup where the values are extracted directly from Yule trees. The values (n) for  seem more similar to values from the Yule tree. We should expect this as here we have shown an exact limit distribution. An overall assessment of the quantiles is given by the root{mean{square{ error (RMSE) row in Tab. 1. We consider the quantiles at the = 0:001, = 0:005, = 0:01, = 0:025, = 0:05, = 0:95, = 0:975, 2 3 4 5 6 7 16 = 0:99, = 0:995, = 0:999 levels. The RMSE is de ned as 8 9 10 ! ! 1 100 5 10 2 X X X 2 2 RMSE = ( ) (q ^ ( ) q( )) + ( ) (q ^ ( ) q( ))  (0:1) i i1 j i i i+1 i j i i j=1 i=1 i=6 (15) with dummy levels = 0 and = 1. The (0:1) normalizes the whole 0 11 mean{square{error. We only look at the error at the tails, so we correct by the fraction of the distributions' support that we consider. As a proxy for the true quantiles we take the pooled values (as explained in Tab. 1) from the \Yule columns". The j index runs over the 100 repeats of the simulations. The RMSE, when using W , seems to be on the level of the RMSE of (N) the \direct simulations". Y has an error of about twice the size (both (N) simulation methods). Looking at Y one can see that the RMSE is exactly on the level of the \Yule column's" RMSE. This is even though we used a recursion of level 10, while an exact match of distributions should take place in the limit (in nite depth recursion). However, the rapid, exponential con- vergence of the contraction seems to make any di erences invisible, already at this recursion level. 4.2 Power of the tests For a given test statistic to be useful one also needs to know its power, the ability to reject the null hypothesis (here Yule tree) when a given alterna- tive one is true. For example, balance indices based only on topology like (n) Sackin's, Colless' or  cannot be expected to di erentiate between any trees that are generated by di erent constant rate birth{death processes or by the coalescent. The rationale behind this is that the topologies induced by the n contemporary species (i.e. we forget about lineages leading to ex- tinct ones) are stochastically indistinguishable no matter what the death or birth rate is (Thm. 2.3, Cor. 2.4 Gernhard, 2008). Similarly, regarding the coalescent at the bottom of their p. 93 Steel and McKenzie (2001) write \: : :, one has the coalescent model [1,18,19]. In this model one starts with n objects, then picks two at random to coalesce, giving n 1 objects. This process is repeated until there is only a single object left. If this process is reversed, starting with one object to give n objects, then it is equivalent to the Yule model. Note that in the coalescent model there is commonly a 17 18 h i h i (n) (n) (n) (n) (n) ~ (n) ~ ~ Var  ( E  ) limit approximation Var  ( E  ) limit approximation (N) (N) (N) (N) ~ ~ Yule N (0; 1) W Y Alg. 2 Y Alg. 3 Yule N (0; 1) Y Alg. 2 Y Alg. 3 Run time 690:918s | 3:905s 0:318s 110:021s 698:269s | 0:233s 44:358s Avg. (= 0) 0:023; 0:029 0 0:025; 0:024 0:024; 0:026 0:019; 0:026 0:02; 0:025 0 0:033; 0:032 0:028; 0:02 0:002 0 0:001 0:006 0 0 0 0:001 0 Var. (= 1) 0:946; 1:074 1 0:921; 1:025 0:928; 1:072 0:932; 1:061 0:939; 1:038 1 0:953; 1:087 0:931; 1:047 1:003 1 0:97 1:014 1 1 1 1:012 1:001 Skew. 1:480; 1:834 0 1:487; 1:917 1:67; 2:124 1:62; 2:197 1:138; 1:368 0 1:163; 1:368 1:159; 1:352 1:643 0 1:68 1:97 1:858 1:245 0 1:25 1:253 Ex. kurt. 3:123; 7:222 0 3:148; 6; 753 3:690; 8:31 3:5; 9:428 1:374; 2:853 0 1:392; 2:707 1:481; 2:88 4:639 0 4:575 6:377 5:435 1:95 0 1:989 2 q(0:025) 1:235;1:194 1:96 1:206;1:174 1:115; 1:087 1:114;1:091 1:257;1:226 1:96 1:276;1:239 1:266;1:23 1:215 1:96 1:19 1:1 1:101 1:245 1:96 1:257 1:246 q(0:05) 1:15;1:104 1:644 1:115;1:085 1:048;1:023 1:047;1:024 1:18;1:146 1:644 1:184;1:154 1:175;1:146 1:123 1:644 1:1 1:033 1:036 1:162 1:644 1:17 1:161 q(0:95) 1:861; 2:07 1:644 1:82; 2:013 1:863; 2:081 1:873; 2:066 1:844; 2:024 1:644 1:883; 2:066 1:856; 2:034 1:946 1:644 1:914 1:942 1:969 1:949 1:644 1:958 1:949 q(0:975) 2:436; 2:735 1:96 2:436; 2:732 2:434; 2:823 2:536; 2:792 2:328; 2:607 1:96 2:383; 2:645 2:360; 2:62 2:587 1:96 2:549 2:642 2:634 2:486 1:96 2:5 2:488 RMSE 0:053 0:762 0:062 0:11 0:108 0:040 0:663 0:048 0:040 Table 1: Simulations based on 100 independent repeats of 10000 independent draws of each random variable (population size for Alg. 2) i.e. columns, bar N (0; 1). The value on the left is the minimum observed from the 100 repeats, on the right the maximum and in the line below from pooling all repeats together. The running times are averages of 100 independent repeats with 10000 draws each. The abbreviations in the row names are for average (Avg.), variance (Var.), skewness (Skew.) excess kurtosis (Ex. kurt.) and root{ mean{square{error (RMSE). The rows q( ) correspond to the, simulated, bar N (0; 1), quantiles i.e. for a random variable X , P (X  q( )) = . All simulations were done in R with the package TreeSim (Stadler, 2009, 2011) used to obtain the Yule trees with speciation rate  = 1, n = 500 tips and a root edge. The (n) (n) Yule tree  ,  values are centred and scaled by expectation and standard deviation from Eqs. (6) and (7). Other centrings and scalings are summarized in Tab. 2. N = 10 for Algs. 2 and 3 is the number of generations and recursion depth of the respective algorithm. In Alg. 2 the initial population is set at 0 and also Y = 0 for Alg. 3. The simulations were run in R 3:4:2 for openSUSE 42:3 (x86 64) on a 3:50GHz. R R Intel Xeon CPU E5{1620 v4. The calculation of the RMSE is described in the text next to Eq. (15). (N) (N) W Y Y Centring () 2(n H )=(n 1) 1 1 n;1 p p p Scaling () (2 9)=9 1 + 1=16 ( 12) NRE (N) (N) W Y Y N NRE NRE Centring () (n + 1 2H )=(n 1) 0 0 n;1 Scaling () (2 18)=9 1=4 ( 12) Table 2: Centrings and scalings applied to obtain the entries in Tab. 1. For a random variable X by its centred and scaled version we mean (X )=. These centrings and scalings are required to obtain mean zero, variance 1 versions of the random variables, i.e. so that they have the same location and scale as the z{transformed cophenetic index. In case of W we take the (N) asymptotic scaling (Thm. 6.8) to be comparable with Y . For the conve- nience of the reader we also provide corresponding centrings and scalings in the no root{edge setup (not considered in Tab. 1). −2 0 2 4 6 8 0 2 4 6 Figure 2: Density estimates of scaled (by theoretical standard deviation) and centred (by theoretical expectation) cophenetic indices (black) from 10000 simulated 500 tip Yule trees with  = 1 and of simulation by Alg. 3 (gray), also scaled and centred to mean 0 and variance 1. Left: density estimates (n) (n) for  , right:  . The curves are calculate by R's density() function. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 probability distribution for the times of coalescences, but in the Yule model we ignore this element." To di erentiate between such trees one needs to take into consideration the branch lengths. Here we compare the power of (n) (n) the Sackin's, Colless',  and  indices at the 5% signi cance level. The null hypothesis is always that the tree is generated by a pure{birth process with rate  = 1. The alternative ones are birth{death processes ( = 1, death rate  = 0:25, and 0:5 using the TreeSim package), coalescent process (ape's rcoal() function Paradis et al., 2004) and the biased speciation model for p 2 f0:05; 0:1; 0:125; 0:15; 0:18; 0:2; 0:25; 0:4; 0:5g: We also simulate a pure{birth process to check if the signi cance level is met. All trees were simulated with an exp(1) root edge. The so{called biased speciation model with parameter p is the tree growth model as described by Blum and Fran cois (2005). In their words, \Assume that the speciation rate of a speci c lineage is equal to r (0  r  1). When a species with speciation rate r splits, one of its descendant species is given the rate pr and the other is given the speciation rate (1 p)r where p is xed for the entire tree. These rates are e ective until the daughter species themselves speciate. Values of p close to 0 or 1 yield very imbalanced trees while values around 0:5 lead to over{balanced phylogenies." We simulated such trees with in{house R code. The quantiles of Sackin's and Colless' indices were obtained using Alg. 3. It is known (Eqs. 2 and 3 Blum and Fran cois, 2005; Blum et al., 2006) that after normalization (centring by expectation and dividing by n) in the limit they satisfy a contraction{type distribution of the form of Eq. (14), i.e. 0 00 Y = Y + (1  )Y + C ( ) for   Unif [0; 1]. The function C ( ) takes the form C ( ) = 2 log  + 2(1  ) log(1  ) + 1 in Sackin's case and C ( ) =  log  + (1  ) log(1  ) + 1 2 min(; 1  ) in Colless' case. It particular, studying the limit of Sackin's index is equiv- alent to studying the Quicksort distribution (Blum and Fran cois, 2005). We 20 can immediately see the main qualitative di erence, the limit of the normal- 2 2 ized cophenetic index has the square in  , (1  ) in the \recursion part" while Sackin's and Colless' have  , (1  ). Using 10000 repeats of Alg. 3 with recursion depth 10 we obtained the following sets of quantiles q(0:025) = 0:983, q(0:95) = 1:189, q(0:975) = 1:493, and q(0:025) = 1:354, q(0:95) = 1:494, q(0:975) = 1:868 respectively for the normalized Sackin's and Colless' indices. Under each model we simulated 10000 trees conditioned on 500 contem- porary tips. We then checked if the tree was outside the 95% \Yule zone" (Yang et al., 2017) by the procedure described in Alg. 1. We calculated the normalized Sackin's, Colless', discrete and continuous cophenetic in- dices (normalizations from Thm. 2.12). The functions sackin.phylo() and colless.phylo() of the phyloTop (Kendall et al., 2016) R package were used while the two cophenetic indices were calculated using a linear time in{house R implementation based on traversing the tree (Cor. 3 Mir et al., 2013). Two tests were considered, a two{sided one and a right{tailed one. For the discrete cophenetic index the quantiles from the simulation by Alg. 3 were considered, for the continuous those from W (Tab. 1). The power is then estimated as the fraction of times the null hypothesis was rejected and represented in Tab. 3 by the corresponding Type II error rates. For the Yule tree simulation we can see that the signi cance level is met. All simulated trees are independent of the trees used to obtain the values in Tab. 1 and quantiles of Sackin's and Colless' indices. Hence, they o er a validation of the rejection regions. We summarize the power study in Tab. 3. As indicated in Alg. 1 one should rst \correct" the tree for the spe- ciation rate, when using the cophenetic index with branch lengths. The distributional results derived here on the cophenetic indices are for a unit speciation rate Yule tree. For a mathematical perspective this is not a sig- ni cant restriction. If one has a pure{birth tree generated by a process with speciation rate  6= 1, then multiplying all branch lengths by  will make the tree equivalent to one with unit rate. Hence, all the results presented here are general up to a multiplicative constant. However, from an applied perspective the situation cannot be treated so lightly. For example, if we used the cophenetic index with branch lengths from a Yule tree with a very large speciation rate, then we would expect a signi cant deviation. However, unless one is interested in deviations from the unit speciation rate Yule tree, this would not be useful. Hence, one needs to correct for this e ect. If the tree did come from a Yule process, then an estimate, , of the speciation rate 21 22 Model Sackin's Colless' c c ^ ^ > 2 > 2 > 2 > > 2 2 mean() variance() Yule 0:952 0:952 0:955 0:955 0:953 0:952 0:949 0:95 0:944 0:944 1 0:002 Coalescent 0:955 0:954 0:956 0:959 0:952 0:955 0:936 0 0:881 0 37:836 42:06 birth{death  = 0:25 0:952 0:953 0:956 0:955 0:948 0:952 0:853 0:874 0:903 0:91 0:87 0:002 birth{death  = 0:5 0:95 0:95 0:952 0:955 0:951 0:953 0:635 0:729 0:739 0:808 0:722 0:001 biased speciation p = 0:05 0 0 0 0 0 0 0 0:98 0 0:542 0:004 4:213 10 biased speciation p = 0:1 0 0 0 0 0 0 0 0:982 0 0:521 0:004 4:241 10 biased speciation p = 0:125 0 0 0 0 0:016 0:431 0 0:981 0 0:522 0:004 4:211 10 biased speciation p = 0:18 1 1 0:497 0:959 1 1 0 0:981 0 0:524 0:004 4:191 10 biased speciation p = 0:2 1 1 1 1 1 1 0 0:982 0 0:508 0:004 4:222 10 biased speciation p = 0:25 1 0:834 1 1 1 1 0 0:98 0 0:515 0:004 4:243 10 biased speciation p = 0:4 1 0 1 0 1 0 0 0:982 0:001 0:51 0:004 4:218 10 biased speciation p = 0:5 1 0 1 0 1 0 0 0:983 0:002 0:509 0:004 4:335 10 Table 3: Power, presented as Type II error rates, of the various indices to detect deviations from the Yule tree for various alternative models at the 5% signi cance level. In the rst row the trees were simulates under the Yule (i.e. we present the Type I error rate) so this is a con rmation of correct signi cance level. Each probability is the fraction of 10000 independently simulated trees that were accepted as Yule trees by the various tests. Columns with \>" label indicate right{tailed test and with label \2" the two{sided test. The critical regions for the cophenetic indices were taken from the pooled estimates in Tab. 1. The superscript c indicates tests, where the trees were corrected for the speciation rate through multiplying all ^ ^ branch lengths with . The mean and variance over all trees of , as obtained through ape's yule() function is reported. Each tree's branches were scaled by its particular  estimate. by maximum likelihood can be obtained. For example, in the work here we used ape's yule() function. Then, one multiplies all branch lengths in the tree by  and calculates the cophenetic index for this transformed tree. It is important to point out that  is only an estimate and hence a random vari- able. The e ects of the this source of randomness on the limit distribution deserve a separate, detailed study. Balance indices that do not use branch lengths do not su er from this issue but on the other hand miss another aspect of the tree|proportions between branch lengths that are non{Yule like. The power analysis presented in Tab. 3 generally agrees with intuition and the power analysis done by Blum and Fran cois (2005). The rst row shows that for all tests and statistics the 5% signi cance level is approx- imately kept. Then, in the next three rows (coalescent and birth{death process) all topology based indices fail completely (the power is at the sig- ni cance level). This is completely unsurprising as the after one removes all speciation events (with lineages) leading to extinct species from a birth{ death tree, the remaining tree is topologically equivalent to a pure birth tree. The same is true for the coalescent model, its topology is identical in law to the Yule tree's one. The cophenetic index with branch lengths has a high Type II error rate but is still better, than the topological indices. However, when one \corrects for " this index manages to nearly (2 trees were not rejected by the two{sided test) perfectly reject the coalescent model trees. Power for the biased speciation model follows the same pattern as Blum and Fran cois (2005) observed. When imbalance is evident, p  0:125, all ( uncorrected) tests were nearly perfect (two{sided discrete cophenetic is an exception). However, the  correction signi cantly worsened the ability (n) of  to detect deviations. As imbalance decreased so did the power of the topological indices. For overbalanced trees one{sided tests failed, two sided worked (just as Blum and Fran cois, 2005, observed). The cophenetic index with branch lengths (without correction), that does not consider only the topology, was able to successfully reject the pure{birth tree for all p (with only minimal Type II error for p  0:4 in the two{sided test case). (n) Interestingly,  's (both corrected and uncorrected) power seems invariant (n) with respect to p. These results are especially promising as  seems to be an index that functions signi cantly better in the dicult, 0:18  p  0:25, regime, even after correcting. At this stage we can point out that a normal approximation to the cophe- netic indices' limit distribution is not appropriate. When doing the above 23 power study we observed that when using the quantiles of the standard nor- (n) mal distribution the right{tailed test based on  rejects 6:81% of Yule (n) (n) trees, based on  rejects 7:03% of Yule trees, two{sided  test rejects (n) 4:87% of Yule trees and two{sided  test rejects 4:66% of Yule trees. The Type I error rates of the two{sided tests are within the observed Monte Carlo errors (in Tab. 3) but the right{tailed tests' Type I error are evidently in- ated. This con rms that the right tail of the scaled cophenetic index is much heavier than normal. In short the power study indicates that the cophenetic index with branch lengths should be considered as an option to detect deviations from the Yule tree. This is because it is able to use information from two sources|the topology and time (a needed direction of development, as indicated in Ch. 33 of Felsenstein, 2004). Actually, this is evident in the decomposition of Eq. (n) (3). The V s describe the topology and the Z s branch lengths. With more information a more powerful testing procedure is possible. Deviations that are not topologically visible, e.g. biased speciation in the 0:18  p  0:25 (n) regimes, are now detectable. To use  one should correct for the e ects of the speciation rate, as otherwise one merely detects deviations from the unit rate Yule tree. This correction is a mixed blessing. It can help or hinder detection. 4.3 Examples with empirical phylogenies It is naturally interesting to ask how do the indices behave for phylogenies estimated from sequence data. Comparing a database of phylogenies, like TreeBase (http://www.treebase.org), with yet another index's distribu- tion under the Yule model should not be expected to yield interesting results. The Yule model has been indicated as inadequate to describe the collection of TreeBase's trees (e.g. Blum and Fran cois, 2007). Therefore, we choose a par- ticular study that estimated a tree and also reported a collection of posterior trees. Sosa et al. (2016a) is a recent work, providing all trees from BEAST's (Drummond and Rambaut, 2007) output, well suited for such a purpose. Sosa et al. (2016a) estimate the evolutionary relationships between a set of 109 tree ferns species. They report a posterior set of 22498 phylogenies (Sosa et al., 2016b). In Tab. 4 we look what percentage of the trees from the posterior was ac- cepted as being consistent with the Yule tree by the various tests and indices. It can be seen that the discrete cophenetic index has a high acceptance rate. 24 The continuous one, which also takes into account branch lengths did not ac- cept a single tree. However, this is lost when one corrects for the speciation rate ( rst ape's multi2di() was used to make the trees binary ones). Most tests and indices rejected the Yule tree for the maximum likelihood estimate of the phylogeny with some exceptions. The two{sided discrete cophenetic index test did not reject the null hypothesis of the pure{birth tree. Also after correcting for the speciation rate (estimated at  = 0:023), neither test based on the continuous cophenetic index rejected the Yule tree. Therefore, one should conclude (based on the \topological balances") that the Yule tree null hypothesis can be rejected for this clade of plants. Sackin's Colless' c c > Sackin's 2 > 2 > 2 > > 2 2 0:03 0:109 0:019 0:062 0:385 0:559 0 0:974 0 0:995 Table 4: Percentage of trees from Sosa et al. (2016b)'s set of posterior trees accepted as Yule trees by the various tests and indices. Columns with \>" label indicate right{tailed test and with label \2" the two{sided test. The critical regions for the cophenetic indices were taken from the pooled esti- mates in Tab. 1. The superscript c indicates tests, where the trees were corrected for the speciation rate through multiplying all branch lengths with . Ape's yule() function returned an average over all trees estimate of of 0:023 with variance 2:988 10 . Each tree's branches were scaled by its particular  estimate. Each tree was rst transformed by ape's multi2di() into a binary one. We also followed Blum and Fran cois (2005) in looking at Yusim et al. (2001)'s phylogeny of the human immunode ciency virus type 1 (HIV{1) group M gene sequences, available in the ape R package. The phylogeny consists of 193 tips and Blum and Fran cois (2005) could not reject the null hypothesis of the pure{birth tree (using Sackin's index amongst others). Af- ter pruning the tree to keep \only the old internal branches that corresponded to the 30 oldest ancestors" they were able to reject the Yule tree. They con- clude that the \results probably indicate a change in the evolutionary rate during the evolution which had more impact on cladogenesis during the early expansion of the virus." Repeating their experiment we nd that only the two versions of the cophenetic index point to a deviation but only in the two{ 25 (n) sided test (see Tab. 5). Based only on the  's test and that it con icts with the conclusions of Sackin's and Colless' one should not draw any con- (n) clusions. However, as  's test indicates a deviation, we can be inclined to reject the null hypothesis of the Yule tree. This is further strengthened by the fact that the signi cance remains after the correction for . Even though the topology as a whole seems consistent with the pure{birth tree the branch lengths are not. The fact that only the two{sided test rejected the Yule tree indicates that the HIV phylogeny is over{balanced in comparison to a pure{birth tree. In fact, in the biased speciation model tree over{balance is observed for values of p close to 0:5 (Blum and Fran cois, 2005). Such trees have a declining speciation rate as they grow and hence this supports Blum and Fran cois (2005)'s aforementioned explanation. ~ ^ Sackin's Colless' ; ; ; ; ; 0:823 0:993 1:689 1:765 1:602 9:313 Table 5: Values of the normalized indices for Yusim et al. (2001)'s HIV{1 phylogeny. Above each index is an indication if the index deviates at the 5% signi cance level from the Yule tree, dash insigni cant, asterisk signi cant. The rst symbol concerns the right{tailed test, the second the two{sided test. The superscripted  is calculated from the tree corrected for the speciation rate by multiplying all branch lengths by . 5 Almost sure behaviour of the cophenetic index (n) We study the asymptotic distributional properties of  for the pure{birth tree model using techniques from our previous papers on branching Brownian and Ornstein{Uhlenbeck processes (Bartoszek, 2014; Bartoszek and Sagitov, 2015a,b; Sagitov and Bartoszek, 2012). We assume that the speciation rate of the tree is  = 1. The key property we will use is that in the pure{ birth tree case the time between two speciation events, k and k + 1 (the rst speciation event is at the root), is exp(k) distributed, as the minimum of k exp(1) random variables. We furthermore, assume that the tree starts with a single species (the origin) that lives for exp(1) time and then splits (the root 26 of the tree) into two species. We consider a conditioned on n contemporary species tree. This conditioning translates into stopping the tree process just before the n + 1 speciation event, i.e. the last interspeciation time is exp(n) (n) distributed. We introduce the notation that U is the height of the tree, (n) is the time to coalescent of two randomly selected tip species and T is the time between speciation events k and k + 1 (see Fig. 3 and Bartoszek and Sagitov, 2015b; Sagitov and Bartoszek, 2012). Figure 3: A pure{birth tree with the various time components marked on it. The between speciation times on this lineage are T , T , T + T and T . 1 2 3 4 5 If we \randomly sample" the pair of extant species \A" and \B", then the (n) two nodes coalesced at time  . Theorem 5.1 The cophenetic index is an increasing sequence of random (n+1) (n) variables,  >  and has the recursive representation n n X X (n) (n) (n+1) (n) (n) =  + nU   ; (16) i ij i=1 i6=j (n) where  is an indicator random variable whether tip i split at the n{th speciation event. 27 Proof From the de nition we can see that (n) (n) (n) (n) (n) = U  = U E  jY ; ij 1i<jn (n) where  is the time to coalescent of tip species i and j. We now develop ij a recursive representation for the cophenetic index. First notice that when a new speciation occurs all coalescent times are extended by T , i.e. n+1 n n P P P P (n+1) (n) (n) (n) =  + T +   + T + T ; n+1 n+1 n+1 ij ij i ij 1i<jn+1 1i<jn i=1 i6=j where the \lone" T is the time to coalescent of the two descendants of the n+1 (n) (n) split tip. The vector  ; : : : ;  consists of n 1 0s and exactly one 1 (a categorical distribution with n categories all with equal probability). For (n) each i the marginal probability that  is 1 is 1=n. We rewrite n n P P P P (n+1) (n) (n) (n) n+1 = T  + n+1 ij ij i ij 1i<jn+1 1i<jn i=1 i6=j and then obtain the recursive form (n) n+1 n+1 n+1 (n+1) (n) = U + T T  + T n+1 n+1 n+1 ij 2 2 2 1i<jn n n P P (n) (n) i ij i=1 i6=j n n P P P (n) (n) (n) n+1 (n) = U  + T n+1 ij i ij i=1 1i<jn i6=j n n P P (n) (n) (n) (n) =  + nU   : i ij i=1 i6=j (n+1) (n) Obviously,  >  . Proofof Theorem 2.4. Obviously n n X X n 1 2 n + 1 (n) (n) (n) W = W + U n+1 n i ij n + 1 n + 1 2 i=1 i6=j 28 and n n P P (n) n+1 n1 (n) 1 E [W jY ] = W + nU n+1 n n ij n+1 2 n i=1 i6=j 1 2 (n) n+1 n1 2 n (n) = W + U ij n+1 2 n 2 i<j n+1 n n1 2 n (n) = W + W + U n n n+1 2 n 2 2 1 1 n+1 n n+1 n1 2 (n) = + W + U n+1 2 n 2 2 (n1)(n+2) n+1 (n) = W + U n(n+1) 2 1 1 1 n+1 n+1 n (n) (n) (n) = W + (U W ) = W + (U  ) n n n 2 2 2 1 1 n+1 n n (n) (n) (n) = W + U (U E  jY ) > W : n n n 2 2 2 Hence, W is a positive submartingale with respect to Y . Notice that n n 2 (n) (n) 2 (n) (n) 2 E W = E (U E  jY )  E (U  ) : (n) (n) Then, using the general formula for the moments of U  (Appendix A, Bartoszek and Sagitov, 2015b), we see that n1 (n) (n) 2 n+1 1 2 E (U  ) = 2 H + H j;2 j;1 n1 (j+1)(j+2) j=1 n1 n+1 n n n;2 j;1 2 2 = 2 H + %  : n;2 n1 n+1 n+1 n+1 (j+1)(j+2) 3 j=1 Hence, E [W ] and E [W ] are O(1) and by the martingale convergence theo- rem W converges almost surely and in L to a nite rst and second moment random variable. Corollary 5.2 W has nite third moment and is L convergent. Proof We rst recall the W is positive. Using the general formula for the (n) (n) moments of U  again we see (n) (n) 3 (n) (n) 3 E (U E  jY )  E (U  ) n1 n+1 1 = 2 (H + 3H + 3H + H ) j;1 j;1 j;2 j;3 n1 (j+1)(j+2) j=1 n1 j;1 n+1 < 16 n1 (j+1)(j+2) j=1 nH nH n+1 n;1 n;1 = 16 = 16 % 16: n1 n+1 n1 3 3 This implies that E [W ] = O(1) and hence L convergence and niteness of the third moment. Remark 5.3 Notice that we (Appendix A, Bartoszek and Sagitov, 2015b) made a typo in the general formula for the cross moment of (n) (n) m (n) E (U  )  : m+r m+r The (1) should not be there, it will cancel with the (1) from the derivative of the Laplace transform. Proofof Theorem 2.7. We write W as n1 k P P (n) (n) (n) (n) (n) W = U E  jY = E U  jY = E 1 T jY n n n i n i=1 k=1 h i n1 n1 n1 n1 P P P P (n) (n) = E T 1 jY = T E 1 jY i n i n k k i=1 k=i i=1 k=i h i n1 n1 n1 P P P (n) (n) = E 1 jY Z = V Z ; n i i k i i=1 k=i i=1 where Z ; : : : ; Z are i.i.d. exp(1) random variables. 1 n1 Remark 5.4 We notice that we may equivalently rewrite ! ! n1 k n1 k h i h i X X X X (n) (n) W = E 1 jY T = E 1 jY Z : (17) n n i n i k k i=1 i=1 k=1 k=1 30 The above and Eq. (3) are very elegant representations of the cophenetic index with branch lengths. They explicitly describe the way the cophenetic index is constructed from a given tree. Proofof Theorem 2.11. The argumentation is analogous to the proof of Thm. 2.4 by using the recursion n n X X (n) (n) (n+1) (n) ~ ~ ~ =  +   +  ; ij i i i=1 i6=j (n) where  is the number of nodes on the path from the root (or appropriately origin) of the tree to tip i, (see also Bartoszek, 2014, esp. Fig. A.8). An alternative proof for almost sure convergence can be found in Section 7.1. 6 Second order properties In this Section we prove a series of rather technical Lemmata and Theorems (n) (n) concerning the second order properties of 1 , V and W . Even though we will not obtain any weak limit, the derived properties do give insight on the delicate behaviour of W and also show that no \simple" limit, e.g. Eq. (4), is possible. To obtain our results we used Mathematica 9:0 for Linux x86 (64{bit) running on Ubuntu 12:04:5 LTS to evaluate the required sums in closed forms. The Mathematica code is available as an appendix to this paper. Lemma 6.1 h i n + 1 1 n + 1 1 (n) Var 1 = 2 1 2 (18) n 1 (k + 1)(k + 2) n 1 (k + 1)(k + 2) Proof h i h i h i (n) (n) (n) Var 1 = E 1 E 1 =   =  (1  ) n;k n;k n;k k k k n;k n+1 1 n+1 1 = 2 1 2 : n1 (k+1)(k+2) n1 (k+1)(k+2) (n) The following lemma is an obvious consequence of the de nition of 1 . Lemma 6.2 For k 6= k 1 2 h i (4)(n + 1) (n) (n) Cov 1 ; 1 =   = : n;k n;k k k 1 2 1 2 2 (n 1) (k + 1)(k + 2)(k + 1)(k + 2) 1 1 2 2 (19) Lemma 6.3 h h ii 2 2 (n) (n(k+1))(n(3k +5k4)(k k8)) n+1 Var E 1 jY = 4 : (20) n 2 2 2 k n(n1) (k+1) (k+2) (k+3)(k+4) Proof Obviously h h ii h i h h ii 2 2 (n) (n) (n) Var E 1 jY = E E 1 jY E E 1 jY : n n n k k k We notice (as Bartoszek and Sagitov, 2015b; Bartoszek, 2016, in Lemmata 11 and 2 respectively) that we may write h i h i (n) (n) (n) E E 1 jY = E 1 1 ; k k;1 k;2 (n) (n) (n) where 1 , 1 are two independent copies of 1 , i.e. we sample a pair k;1 k;2 k of tips twice and ask if both pairs coalesced at the k{th speciation event. There are three possibilities, we (i) drew the same pair, (ii) drew two pairs sharing a single node or (iii) drew two disjoint pairs. Event (i) occurs with 1 1 n n probability , (ii) with probability 2(n2) and (iii) with probability 2 2 n2 n n2 n . As a check notice that 1 + 2(n 2) + = . In case (i) 2 2 2 2 (n) (n) 1 = 1 , hence writing informally k;1 k;2 h i h i (n) (n) (n) E 1 1 j(i) = E 1 =  : n;k k;1 k;2 k To calculate cases (ii) and (iii) we visualize the situation in Fig. 4 and recall the proof of Bartoszek and Sagitov (2015b)'s Lemma 1. Using Mathe- matica we obtain 32 Figure 4: The three possible cases when drawing two random pairs of tip species that coalesce at the k{th speciation event. In the picture we \ran- domly draw" pairs (A; B) and (C; D). h i n1 (n) (n) 3 3 1 1 E 1 1 j(ii) = 1 : : : 1 1 : : : n j+2 j+1 j k;1 k;2 ( ) ( ) ( ) ( ) 2 2 2 2 j=k+1 1 1 k+2 k+1 ( ) ( ) 2 2 (n+1) n(k+1) = 4 : (n1)(n2) (1+k)(2+k)(3+k) Similarly for case (iii) h i n1 j +1 P P (n) (n) 6 6 4 E 1 1 j(iii) = 1 : : : 1 n j +2 j +1 k;1 k;2 2 2 ( ) ( ) ( ) 2 2 2 j =k+2 j =k+1 2 1 3 3 1 1 1 : : : 1 1 : : : j j +2 j +1 j 2 1 1 1 ( ) ( ) ( ) ( ) 2 2 2 2 1 1 k+2 k+1 ( ) ( ) 2 2 (n+1) (n(k+1))(n(k+2)) = 16 : (n1)(n2)(n3) (k+1)(k+2)(k+3)(k+4) We now put this together as h h ii h i 1 1 (n) n n (n) (n) Var E 1 jY =  + 2(n 2) E 1 1 j(ii) n n;k k k;1 k;2 2 2 h i (n) (n) n2 n + E 1 1 j(iii) k;1 k;2 n;k 2 2 33 and we obtain (through Mathematica) h h ii 2 2 (n) (n(k+1))(n(3k +5k4)(k k8)) n+1 Var E 1 jY = 4 n 2 2 2 n(n1) (k+1) (k+2) (k+3)(k+4) 3k +5k4 ! 4 : 2 2 (k+1) (k+2) (k+3)(k+4) Lemma 6.4 For k < k 1 2 h h i h ii (n) (n) (8)(n+1) (3n(k 2))(n(k +1)) 2 2 Cov E 1 jY ; E 1 jY = : n n 2 k k 2 1 n(n1) (k +1)(k +2)(k +1)(k +2)(k +3)(k +4) 1 1 2 2 2 2 (21) Proof Obviously h h i h ii h h i h ii (n) (n) (n) (n) Cov E 1 jY ; E 1 jY = E E 1 jY E 1 jY E [1 ] E [1 ] : n n n n k k k k k k 1 2 1 2 1 2 We notice that h h i h ii h i (n) (n) (n) (n) E E 1 jY E 1 jY = E 1 1 ; n n k k k k 1 2 1 2 (n) (n) where 1 , 1 are the indicator variables if two independently sampled k k 1 2 pairs coalesced at speciation events k < k respectively. There are now two 1 2 possibilities represented in Fig. 5 (notice that since k 6= k the counterpart 1 2 of event (i) in Fig. 4 cannot take place). Event (ii) occurs with probability 4=(n + 1) and (iii) with probability (n 3)=(n + 1). Event (iii) can be divided into three \subevents". Again we recall the proof of Bartoszek and Sagitov (2015b)'s Lemma 1 and we write informally for (ii) using Mathematica h i (n) (n) 3 3 1 1 E 1 1 j(ii) = 1 : : : 1 1 : : : n k +2 k +1 k k k 2 2 2 1 2 ( ) ( ) ( ) ( ) 2 2 2 2 1 1 k +2 k +1 1 1 ( ) ( ) 2 2 (n+1)(n+2) = 4 : (n1)(n2) (k +1)(k +2)(k +2)(k +3) 1 1 2 2 34 Figure 5: The possible cases when drawing two random pairs of tip species that coalesce at speciation events k < k respectively. In the picture we 1 2 \randomly draw" pairs (A; B) and (C; D). In the same way for the subcases of (iii) h i (n) (n) 6 6 1 E 1 1 j(iii) = 1 : : : 1 n k +2 k +1 k k 1 2 2 2 ( ) ( ) ( ) 2 2 2 3 3 1 1 : : : 1 k k +2 k +1 2 1 1 ( ) ( ) ( ) 2 2 2 k 1 6 6 1 + 1 : : : 1 n k +2 k +1 2 2 ( ) ( ) ( ) 2 2 2 j=k +1 3 3 2 1 1 : : : 1 1 : : : k j+2 j+1 j ( ) ( ) ( ) ( ) 2 2 2 2 1 1 k +2 k +1 1 1 ( ) ( ) 2 2 n1 6 6 4 + 1 : : : 1 n j+2 j+1 ( ) ( ) ( ) 2 2 2 j=k +1 3 3 2 1 1 : : : 1 1 : : : j k +2 k +1 k 2 2 2 ( ) ( ) ( ) ( ) 2 2 2 2 1 1 k +2 k +1 1 1 ( ) ( ) 2 2 (n+2)(n+1) n(k +6)5k 14 2 2 = 4 : (n1)(n2)(n3) (k +1)(k +2)(k +2)(k +3)(k +4) 1 1 2 2 2 We now put this together as h h i h ii h i (n) (n) (n) (n) Cov E 1 jY ; E 1 jY = 2(n 2) E 1 1 j(ii) n n k k k k 1 2 2 1 2 h i (n) (n) n2 n + E 1 1 j(iii) n;k n;k k k 1 2 2 2 1 2 35 and we obtain h h i h ii (n) (n) (8)(n+1) (3n(k 2))(n(k +1)) 2 2 Cov E 1 jY ; E 1 jY = n n 2 k k 1 2 n(n1) (k +1)(k +2)(k +1)(k +2)(k +3)(k +4) 1 1 2 2 2 2 ! (24) : (k +1)(k +2)(k +1)(k +2)(k +3)(k +4) 1 1 2 2 2 2 Theorem 6.5 h i 1 n i (n) E V = 2 (22) n 1 i(i + 1) Proof We immediately have h i h h ii n1 (n) (n) E V = E E 1 jY i k k=i n1 n+1 1 1 = 2 n1 i (k+1)(k+2) k=i 1 ni = 2 n1 i(i+1) ! : i(i+1) Theorem 6.6 h i (n + 1) (n i)(n (i + 1))(i 1) (n) Var V = 4 (23) 2 2 2 n(n 1) i (i + 1) (i + 2)(i + 3) Proof We immediately may write using Lemmata 6.3, 6.4 and Mathematica h i h h ii h h i h ii n1 n1 P P (n) (n) (n) (n) Var V = Var E 1 jY + 2 Cov E 1 jY ; E 1 jY 2 n n n i k k k i 1 2 k=i i=k <k 1 2 n1 2 2 (n(k+1))(n(3k +5k4)(k k8)) 4 n+1 2 2 2 2 i n(n1) (k+1) (k+2) (k+3)(k+4) k=i n1 (n+1) (3n(k 2))(n(k +1)) 2 2 n(n1) (k +1)(k +2)(k +1)(k +2)(k +3)(k +4) 1 1 2 2 2 2 i=k <k 1 2 (n+1) (ni)(n(i+1)(i1) = 4 2 2 2 n(n1) i (i+1) (i+2)(i+3) (i1) ! 4 : 2 2 i (i+1) (i+2)(i+3) 36 Theorem 6.7 For 1  i < i  n 1 we have 1 2 h i (n + 1) (i 1)(n i )(n (i + 1)) (n) (n) 1 2 2 Cov V ; V = 4 : (24) i i 1 2 n(n 1) i (i + 1)i (i + 1)(i + 2)(i + 3) 1 1 2 2 2 2 Proof Again using Lemmata 6.3, 6.4, Mathematica and the fact that i < i 1 2 h i h i h i n1 n1 P P (n) (n) (n) (n) Cov V ; V = Cov E 1 jY ; E 1 jY n n i i k k 1 2 i i 1 2 k=i k=i 1 2 h i h i h i n1 i 1 n1 P P P (n) (n) (n) = Var E 1 jY + Cov E 1 jY ; E 1 jY n n n k k k i i 1 2 k=i k=i k=i 2 1 2 h i h i h i i 1 n1 n1 P P P (n) (n) (n) = (i ) Var V + Cov E 1 jY ; E 1 jY n n 2 i k k i i 2 1 2 1 2 k =i k =i k=i 1 1 2 2 2 (n+1) (i 1)(ni )(n(i +1) 1 2 2 = 4 n(n1) i (i +1)i (i +1)(i +2)(i +3) 1 1 2 2 2 2 i 1 ! 4 : i (i +1)i (i +1)(i +2)(i +3) 1 1 2 2 2 2 Theorem 6.8 n1 (n) 4 3 2 Var V = (179n + 588n + 133n 432n 2 2 54n (n1) i=1 468 108n (n + 1)(n + 3)H n1;2 144nH ) !  1:347; n1;1 54 3 n1 (n) 1 2 2 4 Var V Z = (12n (n 6n 4)H 9n i 2 2 n1;2 9n (n1) i=1 3 2 +102n + 51n 24nH 72n 72) n1;1 2 2 !  1  1:193; h i n1 (n) 2 4 3 Var E V Z = ((12H 18) n 24n i 2 2 n1;2 3n (n1) i=1 2 2 +12n (2n + 1)H 24n + 24n + 12) n1;2 4 2 !  12  1:159; h i n1 (n) (n) 1 4 3 2 Var V E V Z = (99n + 174n 21n 144n i 2 2 i i 9n (n1) i=1 108 12n (n + 1)(5n + 7)H n1;2 10 2 24nH ) ! 11   0:034: n1;1 (25) 37 Proof We use Mathematica to rst calculate h i h i n1 n1 n1 P P P (n) (n) (n) (n) Var V = Var V + 2 Cov V ; V i i i i 1 2 i=1 i=1 1=i <i 1 2 1 4 2 3 = (179n 108n (n + 1)(n + 3)H + 588n 2 2 n1;2 54n (n1) +133n 144nH 432n 468) n1;1 !  1:347: 54 3 For the second we again use Mathematica and the fact that the Z s are i.i.d. exp(1). h i n1 n1 P P (n) 1 ni Var E V Z = 2 n1 i(i+1) i=1 i=1 4 2 3 2 2(12H 18)n +2 6n (2n+1)H 12n 12n +12n+6 n1;2 ( n1;2 ) 2 2 3n (n1) 4 2 !  12  1:159: For the third equality we use Mathematica and the fact that for independent families fXg and fYg of random variables we have 2 2 Var [XY ] = E [Y ] Var [X ] + (E [X ]) Var [Y ] ; Cov [X Y ; X Y ] = E [Y ] E [Y ] Cov [X ; X ] + E [X ] E [X ] Cov [Y ; Y ] : 1 1 2 2 1 2 1 2 1 2 1 2 As the Z s are i.i.d. exp(1) we use Mathematica to obtain h i h i n1 n1 n1 P P P (n) (n) (n) (n) Var V Z = Var V Z + 2 Cov V Z ; V Z i i i i i i i 1 i 2 1 2 i=1 i=1 1=i <i 1 2 h i  h i h i n1 n1 n1 P P P (n) (n) (n) (n) = 2 Var V + E V + 2 Cov V ; V i i i i 1 2 i=1 i=1 1=i <i 1 2 2 2 4 3 = (12n (n 6n 4)H 9n + 102n 2 2 n1;2 9n (n1) +51n 24nH 72n 72) n1;1 1 2 ! (2 9)  1:193: For the fourth equality we use the same properties and pair{wise indepen- dence of Z s. h i h i h h i i n1 n1 n1 P P P (n) (n) (n) (n) Var V E V Z = Var V Z + Var E V Z i i i i i i i i=1 i=1 i=1 h h i i n1 (n) (n) 2 Cov V Z ; E V Z i i i 1 i 2 1 2 1=i <i 1 2 h i  h i  h i n1 n1 n1 2 2 P P P (n) (n) (n) = Var V Z + E V 2 E V i i i i=1 i=1 i=1 h i  h i n1 n1 P P (n) (n) = Var V Z E V i i i=1 i=1 4 2 3 2 = (99n 12n (n + 1)(5n + 7)H + 174n 21n 2 2 n1;2 9n (n1) 24nH 144n 108) ! 11   0:034: n1;1 It is worth noting that the above Lemmata and Theorems were con rmed by numerical evaluations of the formulae and comparing these to simulations performed to obtain Fig. 1. As a check also notice that, as implied by variance properties, h i  h i n1 n1 P P (n) (n) (n) Var E V Z + Var V E V Z i i i i i i=1 i=1 n1 (n) 4 2 10 2 2 2 !  12 + 11  =  1 Var V Z : 3 9 9 i=1 Theorem 6.9 2 2 33 h i (n) (n) n2 P V Z E V 1 i i (n) 4 4 55 E Var (n) fV g ! 0:5: (26) h i (n) i=2 Var V (n) Proof Using the limit for the variance of V (Thm. 6.6) and the indepen- dence of the Z s we have 2 2 33 h i (n) (n) n2 n2 P V Z E V P i 2 2 i i (n) i (i+1) (i+2)(i+3) (n) 1 1 4 4 r 55 E Var fV g  E [( V ) ] : h i i i n 4n (i1) (n) i=2 Var V i=2 Now from Thms. 6.6 and 6.5 we have (n) (ni)(n(i+1))(i1) 2 n+1 2 ni E [( V ) ] = 4 + 2 2 2 n(n1) i (i+1) (i+2)(i+3) n1 i(i+1) (ni) (n(i+1))(i1) 1 n+1 i+5 = 4 + 1 ! 4 : 2 2 2 2 (n1) i (i+1) n(ni) (i+2)(i+3) i (i+1)(i+2)(i+3) Plugging this in (and using Mathematica) n2 n2 P 2 2 P 2 2 (n) i (i+1) (i+2)(i+3) i (i+1) (i+2)(i+3)4(i+5) 1 2 1 E [( V ) ] 2 2 2 4n (i1) 4n (i1)i (i+1)(i+2)(i+3) i=2 i=2 n2 (i+1)(i+5) 2 2 1 2 = n = n (n + 11n + 24H 42) ! 0:5: n;1 (i1) 2 i=2 Remark 6.10 Simulations presented in Fig. 6 and Thm. 6.9 suggest a di erent possible CLT, namely h i (n) (n) n2 V Z E V i i weakly (n) r ! some distribution(mean = 0; variance = ): h i (n) i=2 Var V (27) (n) (n) We sum over i = 2; : : : n 2 as V = 1 and V = for all n. It 1 n1 would be tempting to take the distribution to be a normal one. However, we should be wary after Rem. 2.9 and Fig. 1 that for our rather delicate problem even very ne simulations can indicate incorrect weak limits. It remains to study the variance of the conditional variance in Eq. (26). It is not entirely clear if this variance of the conditional variance will converge to 0. Hence, it remains an open problem to investigate the conjecture of Eq. (27). 7 Alternative descriptions 7.1 Di erence process h i (n) (n) Let us consider in detail the families of random variables V and E 1 jY . i k (n) Obviously V is i times the number of pairs that coalesced after the i 1 speciation event for a given Yule tree. Denote 40 0 2 4 6 8 −2 −1 0 1 2 Figure 6: Density estimates of scaled and centred cophenetic indices for 10000 simulated 500 tip Yule trees with  = 1. Left: density estimate (n) (n) (n) of ( E  )= Var [ ]. The black curve is the density tted to simulated data by R's density() function, the gray is the N (0; 1) density. Right: simulation of Eq. (27), the gray curve is the N (0; 1=2) density, and the black curve is the density tted to simulated data by R's density() function. The sample variance of the simulated Eq. (27) values is 0:385 indicating that with n = 500 we still have a high variability or alternatively that the variance of the sample variance in Eq. (26) does not converge to 0. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.2 0.4 0.6 0.8 (n) (n) A := iV : i i As going from n to n + 1 means a new speciation event and coalescent at this new nth event, then n + 1 n (n+1) (n) A  A + 1 : i i 2 2 We also know by previous calculations that h i h i (n) (n) E A = i E V = 2(n i)=((n 1)(i + 1)) ! 2=(i + 1): i i n+1 (n) Let  denote the number of newly introduced coalescent events after (n) n+1 the (i 1){one when we go from n to n + 1 species. Obviously  > . Then, we may write n + 1 n (n+1) (n) (n) A = A +  : i i i 2 2 Now, h i h i h i (n) (n+1) (n) n+1 n 2(n+1i) n(n1) 2(ni) E  = E A E A = i i i 2 2 n(i+1) n(n+1) (n1)(i+1) (ni+1)(n+1)n(ni) n(ni)+n+ni+1n(ni) 2 n+1i ni 2 2 = = = i+1 n n+1 i+1 n(n+1) i+1 n(n+1) 2 2n+1i = ! 0: i+1 n(n+1) (n) Therefore, for every i,  ! 0 almost surely as it is a positive random (n) variable whose expectation goes to 0. However, A is bounded by 1, as it can be understood in terms of the conditional (on tree) cumulative distribution function for the random variable |at which speciation event did a random pair of tips coalesce, i.e. for all i = 1; : : : ; n 1 (n) P (  i 1jY ) = 1 A : (n) Therefore, as A is bounded by 1 and the di erence process (n) (n1) (n) A A = i i i 42 (n) goes almost surely to 0 we may conclude that A converges almost surely to some random variable A . In particular, this implies the almost sure conver- h i (n) n1 (n) gence of V to a limiting random variable V . Furthermore, as E V i i=1 i h i P P (n) (n) n1 n1 and Var V are both O(1) we may conclude that V also i i i=1 i=1 converges almost surely. This means that the discrete version (all T = 1, (n) corresponding to  ) of the cophenetic index converges almost surely (com- pare with Thm. 2.11). 7.2 Poly a urn description The cophenetic index both in the discrete and continuous version has the following Poly a urn description. We start with an urn lled with n balls. Each ball has a number painted on it, 0 initially. At each step we remove a pair of balls, say with numbers x and y and return a ball with the number (x + 1)(y + 1) painted on it. We stop when there is only one ball, it will have value . Denote B as the value painted on the k{th ball in the k;i;n i{th step when we initially started with n balls. Then we can represent the cophenetic index as n1 i n1 i X X XX (n) (n) = B T and  = B : k;i;n i k;i;n i=1 k=1 i=1 k=1 Acknowledgments I was supported by the Knut and Alice Wallenberg Foundation and am now by the Swedish Research Council (Vetenskapsr adet) grant no. 2017{04951. I am grateful to the Barcelona Graduate School of Mathematics (BGSMath) for sponsoring the Workshop on Algebraical and Combinatorial Phyloge- netics which signi cantly contributed to the development of my work. I would like to thank the whole Computational Biology and Bioinformatics Research Group of the Balearic Islands University for hosting me on mul- tiple occasions, many discussions and suggestions on phylogenetic indices. My visits to the Balearic Islands University were partially supported by the the G S Magnuson Foundation of the Royal Swedish Academy of Sciences (grants no. MG2015{0055, MG2017{0066) and The Foundation for Scien- ti c Research and Education in Mathematics (SVeFUM). I would like to ac- 43 knowledge Gabriel Yedid for numerous discussions on the distribution of the cophenetic index and sharing his cophenetic index simulation R code. I am grateful to Cecilia Holmgren and Svante Janson for pointing me to the works on contraction{type distributions and many discussions. I would furthermore like to acknowledge Wojciech Bartoszek, Sergey Bobkov, Joachim Domsta, Serik Sagitov, Mike Steel for helpful comments and discussions related to this work. I am indebted to two anonymous reviewers, an anonymous editor and Haochi Kiang for careful reading of an earlier version of the manuscript and comments signi cantly improving it. References P.{M. Agapow and A. Purvis. Power of eight tree shape statistics to detect nonrandom diversi cation: a comparison by simulation of two models of cladogenesis. Syst. Biol., 51(6):866{872, 2002. K. Bartoszek. Quantifying the e ects of anagenetic and cladogenetic evolu- tion. Math. Biosci., 254:42{57, 2014. K. Bartoszek. A central limit theorem for punctuated equilibrium. ArXiv e-prints, 2016. K. Bartoszek and S. Sagitov. A consistent estimator of the evolutionary rate. J. Theor. Biol., 371:69{78, 2015a. K. Bartoszek and S. Sagitov. Phylogenetic con dence intervals for the opti- mal trait value. J. App. Prob., 52:1115{1132, 2015b. M. G. B. Blum and O. Fran cois. On statistical tests of phylogenetic tree imbalance: The Sackin and other indices revisited. Math. Biosci., 195: 141{153, 2005. M. G. B. Blum and O. Fran cois. Which random processes describe the Tree of Life? A large{scale study of phylogenetic tree imbalance. Syst. Biol, 55 (4):685{691, 2007. M. G. B. Blum, O. Fran cois, and S. Janson. The mean, variance and limiting distribution of two statistics sensitive to phylogenetic tree balance. Ann. Appl. Probab., 16(4):2195{2214, 2006. 44 G. Cardona, A. Mir, and F. Rossell o. Exact formulas for the variance of several balance indices under the Yule model. J. Math. Biol., 67:1833{ 1846, 2013. D. H. Colless. Review of \Phylogenetics: the theory and practise of phylo- genetic systematics". Syst. Zool., 31:100{104, 1982. L. Devroye, J. A. Fill, and R. Neininger. Perfect simulation from the Quick- sort limit distribution. Electronic Comm. Probab., 5(12):95{99, 2000. A. J. Drummond and A. Rambaut. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol. Biol., 7:214, 2007. W. Ewens and G. Grant. Statistical Methods in Bioinformatics: An Intro- duction. Springer, New York, 2005. J. Felsenstein. Inferring Phylogenies. Sinauer Associates Inc., Sundarland, U.S.A., 2004. J. A. Fill and S. Janson. Smoothness and decay properties of the limit- ing Quicksort density function. In D. Gardy and A. Mokkadem, editors, Mathematics and Computer Science: Algorithms, Trees, Combinatorics and Probabilities, Trends in Mathematics, pages 53{64. Birkh auser, Basel, J. A. Fill and S. Janson. Approximating the limiting Quicksort distribution. Rand. Struct. Alg., 19(3-4):376{406, 2001. T. Gernhard. The conditioned reconstructed process. J. Theor. Biol., 253: 769{778, 2008. G. Grimmett and D. Stirzaker. Probability and Random Processes (Third Edition). Oxford University Press, Oxford, 2009. S. Janson. On the tails of the limiting Quicksort distribution. Electronic Comm. Probab., 81:1{7, 2015. M. L. Kendall, M. Boyd, and C. Colijn. phyloTop, 2016. https://cran. r-project.org/web/packages/phyloTop/index.html. A. McKenzie and M. Steel. Distributions of cherries for two models of trees. Math. Biosci., 164:81{92, 2000. 45 A. Mir, F. Rossell o, and L. Rotger. A new balance index for phylogenetic trees. Math. Biosci., 241(1):125{136, 2013. E. Paradis, J. Claude, and K. Strimmer. APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20:289{290, 2004. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. URL http://www.R-project.org. M. R egnier. A limiting distribution for Quicksort. Theor. Inf. Applic., 23 (3):335{343, 1989. A. Rosalsky and M. Sreehari. On the limiting behavior of randomly weighted partial sums. Stat. & Prob. Lett., 40:403{410, 1998. U. R osler. A limit theorem for \Quicksort". Theor. Inf. Applic., 25(1): 85{100, 1991. U. R osler. A xed point theorem for distributions. Stoch. Proc. Applic., 42: 195{214, 1992. U. R osler and L. Rusc  hendorf. The contraction method for recursive algo- rithms. Algorithmica, 29:3{33, 2001. M. J. Sackin. \Good" and \bad" phenograms. Syst. Zool., 21:225{226, 1972. S. Sagitov and K. Bartoszek. Interspecies correlation for neutrally evolving traits. J. Theor. Biol., 309:11{19, 2012. V. Sosa, J. F. Ornelas, S. Ram rez-Barahona, and E. G andara. Historical reconstruction of climatic and elevation preferences and the evolution of cloud forest{adapted tree ferns in Mesoamerica. PeerJ, 4:e2696, 2016a. V. Sosa, J. F. Ornelas, S. Ram rez-Barahona, and E. G andara. Data from: Historical reconstruction of climatic and elevation preferences and the evo- lution of cloud forest{adapted tree ferns in Mesoamerica. Dryad Digital Repository, 2016b. https://doi.org/10.5061/dryad.709t8. T. Stadler. On incomplete sampling under birth-death models and connec- tions to the sampling-based coalescent. J. Theor. Biol., 261(1):58{68, 2009. 46 T. Stadler. Simulating trees with a xed number of extant species. Syst. Biol., 60(5):676{684, 2011. T. Stadler and M. Steel. Distribution of branch lengths and phylogenetic diversity under homogeneous speciation models. J. Theor. Biol., 297:33{ 40, 2012. M. Steel and A. McKenzie. Properties of phylogenetic trees generated by Yule{type speciation models. Math. Biosci., 170:91{112, 2001. G.{D. Yang, P.{M. Agapow, and G. Yedid. The tree balance signature of mass extinction is erased by continued evolution in clades of constrained size with trait{dependent speciation. PLoS ONE, 12(6):e0179553, 2017. Z. Yang. Computational Molecular Evolution. Oxford Series in Ecology and Evolution. Oxford University Press, Oxford, 2006. K. Yusim, M. Peeters, O. G. Phybus, T. Bhattacharya, E. Delaporte, C. Mu- langa, M. Muldoon, J. Theiler, and B. Korber. Using human immunode- ciency virus type 1 sequences to infer historical features of the acquired immunede ciency syndrome epidemic and human immunode ciency virus evolution. Philos. Trans. Roy. Soc. Lond. B, 356:855{866, 2001. 47 Appendix A: Mathematica code for Section 6 M a t h e m a t i c a c o d e u s e d t o o b t a i n t h e c l o s e d f o r m f o r m u l a e o f S e c t i o n 3 . S e c o n d o r d e r p r o p e r t i e s i n K . B a r t o s z e k E x a c t and a p p r o x i m a t e l i m i t b e h a v i o u r o f t h e Y u l e t r e e ' s c o p h e n e t i c i n d e x . The s c r i p t was r u n u s i n g M a t h e m a t i c a 9 . 0 f o r L i n u x x86 (64 b i t ) r u n n i n g on Ubuntu 1 2 . 0 4 . 5 LTS . I t h a s t o be n o t e d t h a t Mathematica ' s o u t p u t s h o u l d be m a n u a l l y p o s t p r o c e s s e d i n o r d e r t o h a v e t h e f o r m u l a e i n t e r m s o f h a r m o n i c sums and n o t d e r i v a t i v e s o f polygamma f u n c t i o n s . A l l t h e r e f e r e n c e s i n t h i s s c r i p t p o i n t t o a p p r o p r i a t e f r a g m e n t s o f t h e m a n u s c r i p t . We c h o o s e t h e p a i r s i n o r d e r , i . e . f i r s t t h e f i r s t p a i r t o c o a l e s c e t h e n t h e s e c o n d p a i r t o c o a l e s c e . ( Compare w i t h p r o o f o f Lemma 1 o f B a r t o s z e k and S a g i t o v ( 2 0 1 5 b )  ) F c o a l P r o b [ n , k , c ]= F u l l S i m p l i f y [ P r o d u c t [ ( 1 c / ( ( r ( r 1 ) ) / 2 ) ) ,f r , k +2 , ng ] ] ( D e f . 2 . 3 , Eq . ( 1 )  ) E1k [ n , k ] : = ( 2 ( n + 1 ) / ( ( n 1 ) ( k + 1 ) ( k + 2 ) ) ) ( Lemma 6 . 1 , Eq . ( 1 8 )  ) Var1k [ n , k ] : = ( E1k [ n , k]E1k [ n , k ] E1k [ n , k ] ) ( Lemma 6 . 2 , Eq . ( 1 9 )  ) Cov1k11k2 [ n , k 1 , k 2 ] := ( E1k [ n , k1 ] E1k [ n , k2 ] ) ( Lemma 6 . 3 , Eq . ( 2 0 )  ) VarE1k [ n , k ] = ( F u l l S i m p l i f y [ ( 1 / ( n ( n 1 ) / 2 ) ) ( 2 ( n + 1 ) / ( ( n 1 ) ( k + 1 ) ( k + 2 ) ) ) ( 2 ( n 2 ) / ( n ( n 1 ) / 2 ) ) (Sum [ F c o a l P r o b [ n , j , 3 ] ( 1 / ( ( j +1) j / 2 ) ) F c o a l P r o b [ j , k , 1 ] ( 1 / ( ( k +1) k / 2 ) ) ,f j , k +1 ,n 1g ] ) 48 + ( ( n 2 ) ( n 3 ) / 2 / ( n ( n 1 ) / 2 ) ) ( Sum [ Sum [ F c o a l P r o b [ n , j 1 , 6 ] ( 4 / ( ( j 1 +1) j 1 / 2 ) ) F c o a l P r o b [ j 1 , j 2 , 3 ] ( 1 / ( ( j 2 +1) j 2 / 2 ) ) F c o a l P r o b [ j 2 , k , 1 ] ( 1 / ( ( k +1) k / 2 ) ) ,f j 1 , j 2 +1 ,n 1g ] ,f j 2 , k +1 ,n2g] ) ( E1k [ n , k ] ) ( E1k [ n , k ] ) ] ) ( Lemma 6 . 4 , Eq . ( 2 1 )  ) CovE1k1E1k2 [ n , k 1 , k 2 ] = ( F u l l S i m p l i f y [ ( 2 ( n 2 ) / ( n ( n 1 ) / 2 ) ) ( F c o a l P r o b [ n , k2 , 3 ] ( 1 / ( ( k2 +1) k2 / 2 ) ) F c o a l P r o b [ k2 , k1 , 1 ] ( 1 / ( ( k1 +1) k1 / 2 ) ) ) ( ( n 2 ) ( n 3 ) / 2 / ( n ( n 1 ) / 2 ) ) ( F c o a l P r o b [ n , k2 , 6 ] ( 1 / ( ( k2 +1) k2 / 2 ) ) F c o a l P r o b [ k2 , k1 , 3 ] ( 1 / ( ( k1 +1) k1 / 2 ) ) Sum [ F c o a l P r o b [ n , k2 , 6 ] ( 1 / ( ( k2 +1) k2 / 2 ) ) F c o a l P r o b [ k2 , j , 3 ] ( 2 / ( ( j +1) j / 2 ) ) F c o a l P r o b [ j , k1 , 1 ] ( 1 / ( ( k1 +1) k1 / 2 ) ) ,f j , k1 +1 , k21g] Sum [ F c o a l P r o b [ n , j , 6 ] ( 4 / ( ( j +1) j / 2 ) ) F c o a l P r o b [ j , k2 , 3 ] ( 1 / ( ( k2 +1) k2 / 2 ) ) F c o a l P r o b [ k2 , k1 , 1 ] ( 1 / ( ( k1 +1) k1 / 2 ) ) ,f j , k2 +1 ,n1g] )(E1k [ n , k1 ] ) ( E1k [ n , k2 ] ) ] ) ( Thm . 6 . 1 , Eq . ( 2 2 )  ) EVi [ n , i ] : = ( F u l l S i m p l i f y [ Sum [ E1k [ n , k ] ,f k , i , n 1g ] / i ] ) ( Thm . 6 . 2 , Eq . ( 2 3 )  ) VarVi [ n , i ] = ( F u l l S i m p l i f y [ ( Sum [ VarE1k [ n , k ] ,f k , i , n1g] +2Sum [ Sum [ CovE1k1E1k2 [ n , k1 , k2 ] ,f k2 , k1 +1 ,n 1g ] ,f k1 , i , n 1g ] ) / ( i i ) ] ) ( Thm . 6 . 3 , Eq . ( 2 4 )  ) CovVi1Vi2 [ n , i 1 , i 2 ] = ( F u l l S i m p l i f y [ ( i 2 i 2 VarVi [ n , i 2 ] +Sum [ Sum [ CovE1k1E1k2 [ n , k1 , k2 ] ,f k2 , i 2 , n 1g ] ,f k1 , i 1 , i 2 1g ] ) / ( i 1 i 2 ) ] ) ( Thm . 6 . 4 , f o r m u l a 1  ) EVi2 [ n , i ] = ( F u l l S i m p l i f y [ VarVi [ n , i ] + ( EVi [ n , i ] ^ 2 ) ] ) VarSumVi [ n ] = ( F u l l S i m p l i f y [ Sum [ EVi2 [ n , i ] ,f i , 1 , n1g] +2Sum [ Sum [ CovVi1Vi2 [ n , i 1 , i 2 ] ,f i 2 , i 1 +1 ,n 1g ] ,f i 1 , 1 , n 2g ] ] ) 49 ( Thm . 6 . 4 Eq . ( 1 3 ) , f o r m u l a 2  ) VarWn [ n ] = ( F u l l S i m p l i f y [ 2 Sum [ VarVi [ n , i ] ,f i , 1 , n1g] +Sum [ ( EVi [ n , i ] ) ^ 2 ,f i , 1 , n1g]+2Sum [ Sum [ CovVi1Vi2 [ n , i 1 , i 2 ] ,f i 2 , i 1 +1 ,n 1g ] ,f i 1 , 1 , n 1g ] ] ) ( Thm . 6 . 4 Eq . ( 1 3 ) , f o r m u l a 3  ) VarWnBar [ n ] = ( F u l l S i m p l i f y [ Sum [ ( EVi [ n , i ] ) ^ 2 ,f i , 1 , n 1g ] ] ) ( Thm . 6 . 4 Eq . ( 1 3 ) , f o r m u l a 4  ) VarWnCentre [ n ] = ( F u l l S i m p l i f y [ 2 Sum [ VarVi [ n , i ] ,f i , 1 , n1g] +2Sum [ Sum [ CovVi1Vi2 [ n , i 1 , i 2 ] ,f i 2 , i 1 +1 ,n 1g ] ,f i 1 , 1 , n 1g ] ] ) ( Thm . 6 . 5  ) F i n a l P a r t [ n ] = (Sum [ ( i + 1 ) ( i + 5 ) / ( i 1 ) ,f i , 2 , n 2g ] ) Appendix B: Counterparts of R osler (1991)'s Prop. 3:2 for the cophenetic index Lemma 7.1 De ne for i 2 f1; : : : ; ng h i h i h i 2 (i) (ni) (n) ~ ~ ~ ~ C (i) = n E  + E  E  + i(n i) and C (x) = 0:5 3x(1 x) for x 2 [0; 1], then 1 1 ~ ~ sup jC (dnxe) C (x)j  2n ln n + O(n ): x2[0;1] Proof Writing out 2 2 2 2 C (i) = n (i + i 2iH + (n i) + (n i) 2(n i)H n n i;1 ni;1 n + 2nH + i(n i) n;1 2 2 1 2 1 = n 3i 3in + n + 2nH n 2iH 2(n i)H n;1 i;1 ni;1 2 2 1 i i 1 < 3 1 + 2n ln n 2 n n Therefore, assuming that 1  dnxe  n 1 50 dnxe dnxe ~ ~ jC (dnxe) C (x)j  3j (1 ) x(1 x)j + 2n ln n n n 1 6 1 2 ~ ~ sup jC (y) C (z)j + 2n ln n  + 2n ln n + O(n ): jyzj<1=n If dnxe = n, we notice that x 2 (1 1=n; 1] and directly obtain 1 1 ~ ~ jC (dnxe) C (x)j  3jx(1 x)j + 2n ln n  2n ln n + : Lemma 7.2 De ne for i 2 f1; : : : ; ng, T; T  exp(2) h i h i h i 1 i n i (i) (ni) (n) 0 0 C (i; T; T ) = E  + E  E  + T + T NRE NRE NRE n 2 2 and for x 2 [0; 1], T; T  exp(2) 1 1 0 2 2 0 C (x; T; T ) = x T + (1 x) T x(1 x) 2 2 then 0 0 1 1 sup jC (dnxe; T; T ) C (x; T; T )j  n ln n + O(n ) + B ; n n x2[0;1] where B is a positive random variable that converges to 0 almost surely with 1 2 expectation decaying as O(n ) and second moment as O(n ). Proof Similarly, as in the proof of Lemma 7.1 we write out i ni 0 2 0 1 2 1 2 C (i; T; T ) = n T + T + (i + i) iH + ((n i) + (n i)) n i;1 2 2 2 2 1 2 (n i)H (n n) + nH ni;1 n;1 2 2 1 i 1 ni 0 i i 1 1 i ni 0 < T + T 1 + n ln n T + T : 2 2 2 n 2 n n n 2 n n i ni 0 We denote A = (1=2) T + T and notice that it converges almost n 2 2 n n surely to 0 with n. Now, assuming that 1  dnxe  n 1 dnxe jC (dnxe) C (x)j  j x jT 2 n dnxe dnxe dnxe 1 2 0 1 + j 1 (1 x) jT +j (1 ) x(1 x)j + n ln n + A 2 n n n 1 2 2 1 2 2 0 < sup jy z jT + sup jy z jT + sup jy(1 y) + z(1 z)j 2 2 jyzj<1=n jyzj<1=n jyzj<1=n +n ln n + A 1 2 1 2 0 2 2 1 (n + O(n ))T + (n + O(n ))T + + O(n ) + n ln n + A : If dnxe = n, we notice that x 2 (1 1=n; 1] and directly obtain 1 2 1 2 0 1 1 jC (dnxe) C (x)j  n T + n T + n + n ln n + A : n n 2 2 Therefore, if we now denote 1 2 1 2 0 B = A + (n + O(n ))T + (n + O(n ))T n n we obtain the statement of the Lemma.

Journal

MathematicsarXiv (Cornell University)

Published: Mar 27, 2017

References