Beta-Binomial stick-breaking non-parametric prior
Gil-Leyva, María F.;Mena, Ramsés H.;Nicoleris, Theodoros
2019-08-19 00:00:00
A new class of nonparametric prior distributions, termed Beta-Binomial stick-breaking process, is proposed. By allowing the underlying length random variables to be dependent through a Beta marginals Markov chain, an appealing discrete random probability measure arises. The chain's dependence parameter controls the ordering of the stick-breaking weights, and thus tunes the model's label-switching ability. Also, by tuning this parameter, the resulting class con- tains the Dirichlet process and the Geometric process priors as particular cases, which is of interest for MCMC implementations. Some properties of the model are discussed and a density estimation algo- rithm is proposed and tested with simulated datasets. Keywords: Beta-Binomial Markov chain, Density estimation, Dirichlet process prior, Geometric process prior, Stick-breaking prior arXiv:1908.06602v2 [math.ST] 11 Aug 2020 1 Introduction Discrete random probability measures and their distributions play a key role in Bayesian nonparametric statistics. The availability of general classes of priors and their dierent representations are crucial for the study of theoretical properties, as well as for the proposal of simulation and estimation algorithms. This continuously encourages the search of competitive alternatives to the canonical model, Fergu- son (1973) Dirichlet process. At the outset, one could consider a (proper) species sampling process (Pitman; 2006) over a measurable Polish space (S;B(S)), = w ; (1) j1 where the atoms, = ( ) , and the weights, W = (w ) , are independent j j j1 j1 iid collections of random variables (r.v.'s), with P , a diuse measure on (S;B(S)), j 0 and w = 1, almost surely (a.s.). To fully specify the law of , one could j1 assume a form for P and place a distribution over the in nite dimensional simplex = f(w ; w ; : : :) : w 0; w = 1g. An important aspect to note is that 1 1 2 i i i1 X X w = w (2) (j) j j j1 j1 for every random permutation of N, , independent of . This means that once the atom's distribution, P , is xed, there are in nitely many distributions over that 0 1 lead to the exact same prior, hence the need to study orderings for the weights. In particular, one can consider the decreasing ordering of its elements, here denoted by # # # W = (w ) , with w > w > a.s., or the size-biased permutation, denoted j1 j 1 2 by W = (w ~ ) , which satis es P[w ~ = w jW] = w , and for n 2 j j1 1 j j P[w ~ = w jW; w ~ ; : : : w ~ ] = 1 : n j 1 n 1 fw 62fw ~ ;:::;w ~ gg j 1 n 1 n 1 1 w ~ i=1 Working with decreasing representations of the weights reduces the identi ability problem that arises from (2) in the sense that if
;
; : : : is sampled i.i.d. from 1 2 , conditionally given , then w corresponds to the atom that appears more frequently in the sequence, w corresponds to the second most frequent value, and so on (e.g., Mena and Walker; 2015). On the other hand, the size-biased permuta- tion of the weights is of interest when the focus is in the clusters featured in the sample, i.e. if
is the jth distinct value to appear in the sample, then the long-run proportion of elements in fn :
=
g coincides precisely with w ~ (Pitman; 1996a). n j Dierent techniques to place distributions on are available (e.g. Ferguson; 1973; Blackwell and MacQueen; 1973; James et al.; 2009) and connections among such techniques are well known (e.g. Ishwaran and James; 2001; Ishwaran and Zare- pour; 2002; Hjort et al.; 2010). Perhaps one of the most practical constructions is enjoyed by the so-called stick-breaking process (McCloskey; 1965; Sethuraman; 1994; Ishwaran and James; 2001) where the weights are decomposed as j 1 w = v ; w = v (1 v ); j 2; (3) 1 1 j j i i=1 2 for some sequence taking values in [0; 1], V = (v ) , hereinafter referred to as i1 length variables (l.v.'s). The practical compromise inherent to (3) is relatively lit- tle, as most practical classes of priors have a stick-breaking representation, e.g. the Dirichlet process (Ferguson; 1973; Sethuraman; 1994), its two-parameter generaliza- tion (Perman et al.; 1992; Pitman and Yor; 1992), the normalized inverse-Gaussian process (Favaro et al.; 2012) and the more general class of homogeneous normalized random measures with independent increments (Favaro et al.; 2016). In particular, iid the Dirichlet process is recovered when v Be(1; ), for some > 0, and, as shown by Pitman (1996b), the resulting weights coincide with the corresponding size-biased permutation of them, an ideal feature for clustering (Pitman; 1996a). A dierent stick-breaking prior is the Geometric process, introduced by Fuentes-Garc a et al. (2010). For this case, the decreasing ordering of the weights takes the form j 1 w = (1 ) ; j 1; for some Be(; ), with ; > 0. Here the random variables (v ) are com- i i1 pletely dependent, indeed identical, unlike for the Dirichlet process. As mentioned above, the ordering of the weights, or lack of it, is of high relevance when using Bayesian nonparametric priors for density estimation and/or clustering. The de- pendence on only one random variable makes the Geometric process an attractive choice from a numerical point of view, and also makes it quite simple to generalize to non-exchangeable settings (Fuentes-Garc a et al.; 2009; Mena et al.; 2011; Hatjispy- ros et al.; 2018). Furthermore, as shown by Bissiri and Ongaro (2014), both the Dirichlet and the Geometric processes have full support. We propose a new class of stick-breaking distributions over , featured by dependent l.v.'s driven by a strictly stationary Beta Markov chain, thus leading to a novel family of random probability measures, the Beta-Binomial stick-breaking (BBSB) priors. The Beta Markov chain in question has a dependence parameter which modulates the ordering of the corresponding weights, allowing BBSB priors to enjoy a good trade-o between weights identi ability and mixing. For extreme values of the dependence parameter, we nd that the Dirichlet process and the Geometric process priors are particular cases of our model. Furthermore, using an extension of the aforementioned result by Bissiri and Ongaro (2014), we will see that BBSB priors also have full support. The remaining part of the article is organized as follows: In Section 2 we present the construction of the Markov chain with Be(; ) marginals. Inhere, we also anal- yse some special and limiting cases that will subsequently allow to recover the Dirich- let and Geometric processes. This Markov chain then assembles in Section 3 a se- quence of l.v.'s, thus leading to Beta-Binomial stick-breaking priors. In Section 4 we derive a sampling scheme for density estimation and, in Section 5 we test it in simulated data. The proofs of the main results are deferred to the Appendix. 2 Beta-Binomial Markov chain Following Pitt et al. (2002), given a density function (v; x) with marginals (v) v;x v and (x), and whose conditional distributions are (vjx) and (xjv), it is pos- x vjx xjv sible to construct two of reversible Markov chains (v ) and (x ) with stationary i i1 i i1 distributions and respectively. The construction considers the law induced by v x 3 v , and fx j v g (jv ), fv j x g (jx ), for i 1; where v 1 v i i i i+1 i i i+1 xjv vjx is conditionally independent of (v ; x ; : : : ; v ; x ; v ) given x , and analogously 1 1 i 1 i 1 i i x is conditionally independent of (v ; x ; : : : ; v ; x ) given v . Arising from the i+1 1 1 i i i+1 Beta-Binomial conjugate model, we take (v; x) = Bin(xj; v)Be(vj; ); v;x for some ; > 0, 2 f0; 1; : : :g, and where Bin(0; p) = . Thus, the dependence induced by v Be(; ), and fx j v g Bin(; v ), fv j x g Be( + x ; + 1 i i i i+1 i i x ), for i 1 generates Markov chains, V = (v ) and X = (x ) , where the i i i1 i i1 former has transition probabilities given by P[v 2 Ajv ] = Be(sj + x; + x)Bin(xj; v )ds; (4) i i 1 i 1 x=0 and stationary distribution Be(; ), and the latter P[x = xjx ] = Bin(xj; p)Be(pj + x ; + x )dp i i 1 i 1 i 1 (5) ( + x ) ( + x ) i 1 x" i 1 x" = ; x ( + + ) m 1 where (y) = (y + j), and its stationary distribution is m" j=0 () () x" x" P[x = x] = : (6) x ( + ) To any Markov chains, V, X and (V; X) = (v ; x ) , we refer to them as i i i1 Beta, Binomial and Beta-Binomial chains. See Nieto-Barajas and Walker (2002) and Mena and Walker (2009) for more on this kind of Markov chains. In what follows, we focus on the the Beta chain and some of its properties, speci cally in how the parameter aects the dependence of the chain. This will be relevant for our construction of the nonparametric prior in the following section. Proposition 2.1. Let (V; X) be a Beta-Binomial chain with parameters (; ; ), then for the Beta chain, V, and for every i 1, we have the following conditional moments + v a) E[v jv ] = : i+1 i + + ( + v )( + (1 v )) + v (1 v )( + + ) i i i i b) Var(v jv ) = : i+1 i ( + + ) ( + + + 1) c) Cov(v ; v ) = : i i+1 ( + ) ( + + 1)( + + ) Cov(v ; v ) i i+1 p p d) = = : v ;v i i+1 + + Var(v ) Var(v ) i i+1 Fixing the value of and increasing either or , the correlation coecient, goes to 0. Conversely, if we x and , for large values of , 1. v ;v v ;v i i+1 i i+1 Also, if and are very small with respect to 2v (1 v ) i i E[v jv ] v and Var(v jv ) : i+1 i i i+1 i + 1 4 Hence, intuition tells us that the conditional distribution of v given v , tends i+1 i to , as grows, see Figure 1. The following result generalizes this intuition. Figure 1: Conditional densities of v given v = 0:4, for distinct values of . We i+1 i vary in the set f10; 50; 200; 1000; 5000g, the values of and were xed to 10. () () Proposition 2.2. Let V = v be a Beta-chain with parameters (; ; ). i1 (0) (i) For = 0, V is a sequence of i.i.d. random variables with distribution Be(; ). () (ii) As ! 1, V converges in distribution to (; ; : : :), where Be(; ). 3 Beta-binomial stick-breaking prior We call Beta-Binomial stick-breaking prior to any species sampling process, , with weights sequence as in (3) for some l.v.'s, V, driven by a Beta chain with transition density (4). As usual, the parameters of the l.v.'s are inherited to the prior, adding to the latter, the diuse probability measure, P , as an additional parameter. The rst property to check is that the corresponding weights add up to one. Proposition 3.1. Let W be as in equation (3), for some Beta chain, V. Then a:s: w = 1: j1 Moreover, notice that for every 0 < < " < 1 and n 1, any Beta-Binomial chain, (V; X), with parameters (; ; ), satis es " # " # n n \ Y P ( < v < ") = E P [ < v < "jX] i i i=1 i=1 " # = E P[ < v < "jx ] P [ < v < "jx ; x ] > 0; 1 1 i i 1 i i=2 5 as conditionally given X, the elements of V are independent and Beta distributed. As shown by Bissiri and Ongaro (2014), the above observation shows that any Beta- Binomial prior has full support, and thus feasible for nonparametric inference. The following results, which follow from Proposition 2.2, motivate their study. () Theorem 3.2. Let be a BBSB prior with parameters (; ; ; P ) then (0) (i) For = 0 and = 1, is a Dirichlet process with parameters (; P ). () (ii) For any and xed, as ! 1, converges in distribution to the Geo- metric process, , with parameters (; ; P ). In terms of the ordering of the corresponding weights, we have the following corollary. () Corollary 3.3. Let w be as in equation (3), for some Beta chain, j1 () v , with parameters (; ; ). Then i1 () (i) For = 1, = 0, and any choice of , w is size-biased ordered. j1 (ii) For any choices of and , and for every j 1 h i () () lim P w < w = 1: j+1 j !1 Figure 2: Simulations of (w ) (A:2 and B:2) and their corresponding l.v.'s (A:1 j=1 and B:1 respectively) for distinct values of . For the Beta chains in A:1, we xed = 1 and = 1, for the ones in B:1 we used the same value of , whilst = 10. The chains in a single graph share the same initial r.v. for the sake of a simpler analysis. If we x = 1, the choice = 0 implies that W = W is size-biased ordered. In general for such sequences E[w ~ ] E[w ~ ], even though w ~ w ~ does not j j+1 j j+1 6 occur with probability 1. On the other extreme, as ! 1 we have the decreasing h i # # ordering of the Geometric weights W = W , which satisfy P w w = 1. j j+1 Roughly speaking, by increasing the parameter , we make the weights sequence more likely to be decreasingly ordered. Figure 2 shows some simulations of (w ) j=1 and their corresponding l.v.'s that illustrate the aforementioned behaviour. The initial value, v , of the Beta chain strongly aects the behaviour of the complete sequence of weights, this is particularly evident for large values of . Recall that if is suciently large we have v v , so for instance if v is close to 0, then (1 v ) 1 1 2 1 1 and w = v (1 v ) v = w , which means that if v > v even slightly, we 2 2 1 1 1 2 1 might obtain w > w . Alternatively, a large value of v , translates to a small value 2 1 1 of (1 v ), so in order to obtain w > w , it would require v to be signi cantly 1 2 1 2 larger than v , which under the assumption that is large, is not very likely to happen, as v v . The same intuition is inherited to the subsequent indexes since 1 2 we also have v v , for large values of . Hence, the larger/smaller v is, 2 3 1 the larger/smaller we expect v to be, for i > 1. Moreover, for large values of the parameter we expect v to take small values, thus in general, a bigger value of requires an even larger value of , to induce a stochastically decreasing ordering of the weights. 3.1 Distribution of the number of groups When working with any species sampling process, , such as a Dirichlet, BBSB or Geometric process. . . , a r.v. of interest is the number of distinct values, K , that a sample f
; : : : ;
g driven by exhibits. Although for some priors it is possible to 1 n compute or characterize the probabilistic behaviour of K (see for instance Pitman; 2006), in general this is not an easy task to do. Despite this, whenever it is feasible to obtain samples from the weights sequence, W, as is the case of any BBSB prior, obtaining samples from K can be easily achieved as follows: Sample n independent ' ' U(0; 1) r.v.'s, (u ) , and (w ) where ' is some constant satisfying w > k j j k=1 j=1 j=1 i 1 max u . For k 2 f1; : : : ; ng and i 2 f1; : : : ; 'g, let d = i if and only if w < k k k j j=1 u < w (with the convention that the empty sum equals 0) then the number k j j=1 of distinct values (d ; : : : ; d ) exhibits is precisely a sample from K . 1 n n To understand how the parameters of a BBSB prior aect the distribution of K , we sampled as aforementioned varying the values of , and . Particularly, Figure 3(A) shows the distribution of K corresponding to the Dirichlet process, for which is well known that E[K ] increases when grows. This location behavior is also observed for other xed values of (B; C and D). Figures 3 and 4, illustrate how for xed and , an increment on contributes to the distribution of K with a heavier right tail, and thus a larger mean and variance, say the prior on K is less informative. In Figure 3, where we xed = 1, it can be observed that for bigger values of , the distribution of K is more sensitive to an increment of . The same can be seen in Figure 4, for xed = 1 and smaller values of . 7 Figure 3: Frequency polygons of samples of size 10000 from K for distinct values of and and xing = 1. For the frequency polygons in A; B and C we xed to 0; 10 and 100 respectively, whilst the frequency polygons in D correspond to the Geometric prior. For each xed value of , we vary in the set f0:5; 1; 3; 6; 10g. Figure 4: Frequency polygons of samples of size 10000 from K for distinct values of and and xing = 1. For the frequency polygons in A; B and C we xed to 0; 10 and 100 respectively, whilst the frequency polygons in D correspond to the Geometric prior. For each xed value of , we vary in the set f0:5; 0:75; 1; 3; 6g. 4 Density estimation for Beta-Binomial mixtures Given a BBSB prior, , and a density kernel, g(js), with parameter space S, we can (n) consider BBSB mixtures. Namely, we can model elements in y = fy ; : : : ; y g as 1 n i.i.d. sampled from the random density 8 Z (y) := (yjW; ) = g(yjs)(ds) = w g(yj ): (7) j j j1 For MCMC implementation purposes, and following Walker (2007), this random density can be augmented as (y; ujW; ) = 1 g(yj ); (8) fu<w g j1 where it can be easily deduced (ujW) = 1 : (9) fu<w g j1 As in the Dirichlet process case, given u, the number of components in the mixture is nite, with indexes being the elements of A (W) = fj : u < w g, that is u j (yju; W; ) = g(yj ): (10) jA (W)j j2A (W) Using the membership variable d, i.e. d = j i y is sampled from g(j ), one can further consider the augmented joint density (y; u; djW; ) = 1 g(yj ): (11) fu<w g d The complete data likelihood based on a sample of size n from (11) is easily seen to be L ((y ; u ; d ) ) = 1 g(y j ); (12) ;w k k k fu <w g k d k=1 k d k k=1 and under the assumption P has a density, p , with respect to a suitable measure, 0 0 the full joint density of every variable involved is ((y ; u ; d ) ; (v ) ; ( ) ) k k k i i1 j j1 k=1 Y Y = 1 g(y j ) p ( ) k d 0 j fu <w g k d k k=1 j1 (13) 0 1 YX @ A Be(v j; ) Be(v j + x; + x)Bin(xj; v ) ; 1 i+1 i i1 x=0 d 1 recall w = v (1 v ) with the convention that the empty product equals d d i k k i=1 4.1 Full conditionals The full conditional distributions, required for posterior inference via a Gibbs sam- pler implementation, are proportional to (13), and given as follows. 1. Updating : ( j : : :) / p ( ) g(y j ); j 1; j 0 j k j k2D where D = fk 1 : d = jg. If p and g form a conjugate pair, the above is easy j k 0 to sample from. 9 n 2. Updating V and U = (u ) as a block: k=1 (U; Vj : : :) / w 1 w fu <w g d d k d k=1 0 1 YX @ A Be(v j; ) Be(v j + x; + x)Bin(xj; v ) : 1 i+1 i i1 x=0 Q Q d 1 0 As w = v (1 v ), with the convention () = 1, then d d k k i=1 i=1 " # h i 1 1 (U;Vj : : :) / w 1 v (1 v ) Be(v j; ) fu <w g 1 1 k d 1 k=1 2 3 YX i+1 i+1 4 5 (v ) (1 v ) Be(v j + x; + x)Bin(xj; v ) i+1 i+1 i+1 i i1 x=0 where n n X X = 1 and = 1 : i i fd =ig fd >ig k k k=1 k=1 Recalling that for m 2 N, and z > 0, (m + z) = (z) (z), we obtain m" " # (U; Vj : : :) / U(u j0; w ) [Be(v j + ; + )] k d 1 1 1 k=1 YX Be(v j + + x; + + x) i+1 i+1 i+1 i1 x=0 ( + x) ( + x) " " i+1 i+1 Bin(xj; v ) ; ( + + ) ( + )" i+1 i+1 with the convention (z) = 1. Thus, to update V and U, we rst sample V from 0" (Vj : : : (exclude U) : : :) / [Be(v j + ; + )] 1 1 1 YX Be(v j + + x; + + x) i+1 i+1 i+1 i1 x=0 ( + x) ( + x) " " i+1 i+1 Bin(xj; v ) ; ( + + ) ( + )" i+1 i+1 which can be normalized to a product of Beta densities mixtures, and latter sample U from (Uj : : :) = U(u j0; w ). k d k=1 k 3. Updating D = (d ) : k=1 (d = jj : : :) / g(y j )1 ; k 2 f1; : : : ; ng; k k j fu <w g k j which is a discrete distribution with nite support, hence easy to sample from. 10 Remark 4.1 (For the updating of and V). As it is well-known for this algorithm, we do not need to sample v and for every j 1, it suces to sample enough j j of them so that step 3 can take place. Explicitly, it suces to sample and v for j j j ', where ' is a constant such that w max (1 u ), then it is not j k k j=1 possible that w > u for any k n and j > '. j k 4.2 Posterior distribution analysis (t) (t) (t) (t) (n) Given ; w ; u ; d ; from f; W; U; Djy g obtained j j k k j j k k t=1 after T iterations of the Gibbs sampler, following (10) we estimate the density of the data by T n h i X X X 1 1 1 (t) (n) E y g ; (14) (t) T n (t) t=1 k=1 j2A n o (t) (t) (t) where A = j : u < w . Furthermore, we can also estimate the posterior k k (n) distribution of fK jy g through h i (n) P K = m y 1 (t) ; (15) fK =mg t=1 (t) (t) where K is the number of distinct values d exhibits. As usual, when working with mixtures of densities, K can be interpreted as the number of components of the (n) mixture featured in the sample y , that is the number of elements in fg(j )g j j1 (n) such that y is sampled from g(j ), for some y 2 y . This way, the estimates k j k (14) together with (15), give us information of how well a model performs for the given data set. Among the models for which (14) adjusts well to the data, those for which (15) favours smaller values of m might be preferred, as this means the model is mixing the components, fg(j )g , more eciently. j j1 4.3 Posterior inference for the dependence parameter In order to highlight the role of the dependence parameter, , we incorporate its posterior inference. Namely, we consider this parameter random and endow it with a prior distribution, . For this case, the likelihood (12) remains identical and the joint distribution (13) is multiplied by (). It can easily be seen that, conditionally given , the full conditionals f( j : : :)g , f(d j : : :)g and (V; Uj : : :) also j j1 k k=1 remain the same. As to the full conditional of given the rest of the r.v.'s, we have that YX ( = j : : :) / () Be(v j + x; + x)Bin(xj; v ); (16) i+1 i i1 x=0 which is easy to sample from if has nite support. Summarizing, at each iteration of the Gibbs sampler, we update , V, U and D as above and add a fourth step in which we sample from (16). (t) Finally, given the samples obtained after T iterations of the Gibbs t=1 sampler, once the burn-in period has elapsed, we estimate the posterior distribution 11 of by (n) P[ = jy ] 1 (t) f =g t=1 5 Illustrations In principle, every choice of leads to robust posterior MCMC estimates, after an appropriate burn-in period and enough valid iterations. However, depending on the sample, initial conditions, and current parameter values in the Gibbs sampler, the need to more/less ordered weights, thus dierent values of , might be required. To test the performance of BBSB priors for density estimation, we rst conduct a small experiment in which we x the value of to 0; 10; 100 and 1 and compare the results provided by the 4 distinct models. Secondly, in order to choose the optimal value of for a dataset and given that the rest of the parameters are xed, we place a prior distribution on the dependence parameter and analyse its posterior distribution. Here we also compare our models to another well-known stick-breaking prior, the Pitman-Yor process (Perman et al.; 1992; Pitman and Yor; 1992). In all cases we assume a Gaussian kernel with random location and scale parameters, i.e., for each j 1, = (m ; p ), and g(yj ) = N(yjm ; p ). To attain a conjugate j j j j j pair for p and g, we assume p ( ) = N(m j#; p )Ga(p ja; b), where a = b = 0:5, 0 0 j j j = 100 and # = n y : k=1 5.1 Analysis for BBSB mixtures with xed dependence parameter For this exercise we simulated a data set (database 1) containing 200 observations and featuring 11 modes equally spaced. As it is well known for this type of data, and if the parameter is not carefully chosen, the Dirichlet mixture under estimates the number modes featured in the sample. Alternatively, Geometric mixtures do recognize every mode, but they tend to use a large number of mixture components. In order to study how BBSB priors perform in this context, and to compare them with the Dirichlet and Geometric processes, we xed = 1, = 1 and vary in the set f0; 10; 100;1g. No burn-in period was considered, so that one may analyse the number of iterations required by the model to provide a good estimate. In Figure 5 we observe that the Dirichlet process (A) fails to recover the eleven modes featured in the dataset, the three remaining models are able to capture the 11 well-separated modes. In terms of the number of iterations required to recognize the modes, we observe that BBSB mixtures with larger values of (C and D) perform better. Consistently with the prior analysis of the number of groups, in Figure 6 we observe that the posterior mean and variance increase as does. Comparing Figures 5 and 6 we note that the model with = 10 (B) mixes better the components of the mixture than the other ones in the sense that fewer components were needed in order to capture every mode. Overall, the cases = 10 (B) and = 100 (C), seem to inherit desirable properties from the limiting cases, i.e. = 0 (A) and = 1 (D). From the Dirichlet process they inherit a more ecient component mixing, while from the Geometric process they inherit the
exibility to adapt even if the parameter is not carefully chosen. 12 Figure 5: Evolution of the estimated densities for database 1, through the rst 3000 iterations of the Gibbs sampler, for four distinct BBSB mixtures. The estimated densities in A; B; C and D correspond to BBSB mixtures with xed to 0; 10; 100 and 1 respectively, in the four cases = = 1. Figure 6: Frequency polygon of the estimated posterior distributions of K given database 1 for the four BBSB mixtures which share the parameters = = 1, and dier on the parameter , same one that varies in the set f0; 10; 100;1g. 5.2 Analysis for BBSB mixtures with random dependence param- eter The main objective of this analysis is to determine the optimal value of for dierent datasets. To this aim, we rst we consider a very simple data set (database 2) consisting of 200 observations that were sampled from a mixture of two Gaussian distributions. And secondly, we examine a more complicated set of data (database 3) that contains 200 observations sampled from a mixture of seven Gaussian kernels with distinct means, variances and weights, this database was created and studied before by Lijoi et al. (2007). For each, database 2 and database 3, we study three 13 BBSB mixtures with parameters and xed to distinct values, and compare the estimations with the ones provided by a Pitman-Yor mixture. Recall that this two- parameter generalization of the Dirichlet process has stick-breaking representation with independent l.v.'s v Be(1 ; + i) where 0 < 1 and > (see for instance Perman et al.; 1992; Pitman and Yor; 1992; Pitman; 2006, for further details). In particular the Dirichlet process is recovered when = 0. For this mixture we xed and consider the other parameter random with a uniform distribution over [0; 1], this way the model is allowed to choose the best value of for the data set. In a similar spirit, for every BBSB mixture considered here, the parameter was considered random with a uniform prior distribution over f0; 1; : : : ; 100g. 5.2.1 Results for database 2 In Figure 7 we observe that the estimated densities for the four mixtures adjust well to the data and do not dier signi cantly. In Figure 8 we see that every posterior distribution is asymmetrical, hence we will estimate the corresponding randomized parameter by the mode rather that the mean. For the BBSB models with parameter = 1 (A and B), the posterior mode of equals 0, suggesting that for this simple data set, the Dirichlet process is an excellent choice. In D we see that for the Pitman-Yor mixture the posterior distribution of also assigns a bigger probability to values closer to 0, so it agrees with our models that the Dirichlet process adjust well to this data set. As for the BBSB model with = 0:3 and = 2, we observe that the posterior distribution of (C) prefers a value bigger than 0. Explicitly, the posterior mode of this distribution is = 6. This could be due to the fact that for = 0:3 and = 2 the stick-breaking mixture with completely independent l.v.'s is not a good choice for this dataset, so the BBSB mixture corrects this by adjusting the value of the dependence parameter. Figure 7: Estimated densities for database 2, taking into account 5000 iterations of the Gibbs sampler after a burn-in period of 3000, for three distinct BBSB mixtures with parameters (; ) xed to (1; 1); (1; 0:3) and (0:3; 2), and a Pitman-Yor mixture with parameter = 1. 14 Figure 8: Posterior distributions of (A; B and C) for the BBSB mixtures with parameters (; ) xed to (1; 1); (1; 0:3) and (0:3; 2), respectively. D illustrates the posterior distribution of for the Pitman-Yor mixture with = 1. The dotted and dashed lines indicate the posterior means and modes, respectively. 5.2.2 Results for database 3 Insomuch as the distributions in Figure 10 are asymmetrical, once again we estimate the randomized parameter by the posterior modes. In the same gure we observe that the posterior distribution of for every BBSB mixture (A, B and C) favours values of that are bigger than 0, yet smaller than 50. Speci cally, the posterior modes of for the BBSB models with (; ) xed to (1; 1:3); (1; 0:3) and (0:3; 2) are 12, 12 and 30, respectively. That is to say, in every case the model estimates that corresponding l.v.'s are dependent. In fact, if we insert the parameters = 1; 1; 0:3, = 1:3; 0:3; 2, and the posterior mode of = 12; 12; 30, into Proposition 2.1 (d), we estimate the correlation coecients of consecutive l.v.'s by 0:8992, 0:9023 and 0:9288, respectively. Notice that although the posterior modes of are not large, these choices aect greatly the dependence of the l.v.'s in question. In particular, for the couple of BBSB mixtures with = 1, this suggest the Dirichlet mixture is not the best choice. Among these two, for the one with = 1:3, we chose this parameter so that for the Dirichlet prior E[K ] 7, which coincides with the number of actual modes featured in database 3. Even in this case, the posterior distribution of suggest that other BBSB models t better than the Dirichlet mixture. As to the Pitman-Yor mixture, for which was also chosen as above, we see in Figure 10 (D) that the posterior distribution of favours values close to 0. Meaning that this model suggests that among the possibilities, the Dirichlet process is the best t. However, if we concentrate in Figure 9 we see that the estimated densities by all three BBSB mixtures adjust well the data and recover the seven modes featuring the data set, whilst the Pitman-Yor model confuses the couple of modes in the left hand side of the gure. This suggests the class of BBSB mixtures oers a bigger capacity to adjust to the data by tuning the parameter , than the class of Pitman-Yor mixtures have by tuning the parameter . 15 Figure 9: Estimated densities for database 3, taking into account 5000 iterations of the Gibbs sampler after a burn-in period of 3000, for three distinct BBSB mix- tures with parameters (; ) xed to (1; 1:3); (1; 0:3) and (0:3; 2), and a Pitman-Yor mixture with parameter = 1:3. Figure 10: Posterior distributions of (A; B and C) for the BBSB mixtures with parameters (; ) xed to (1; 1:3); (1; 0:3) and (0:3; 2), respectively. D illustrates the posterior distribution of for the Pitman-Yor mixture with = 1:3. The dotted and dashed lines indicate the posterior means and modes, respectively. 6 Discussion By using Beta chains as the l.v.'s of stick-breaking sequences, we were able to con- struct a new family of distributions over the in nite dimensional simplex, hence a new class of species sampling priors. The parameter, , that modulates the depen- dence among the elements of the Beta chain, also modulates the ordering of the corresponding weights. While the choice = 0 and = 1 recovers the size-biased permutation of the weights of Dirichlet processes, as ! 1, we recover the decreas- 16 ing ordered weights of Geometric processes, both classes of processes being models of interest. This approach to de ne priors also allows the construction of random measures that are hybrids between Dirichlet and Geometric processes. Furthermore, how similar is the BBSB prior to one model or the other can also be tuned by the parameter . As to the prior distribution of K , generally speaking, we found that a larger value of translates to a less informative prior. This in turn allows more
ex- ible models in a density estimation context. In general the class of BBSB mixtures oers models with a great capacity to adapt to distinct data sets and models with a ecient component mixing. By endowing the parameter with a prior distribu- tion, one can estimate its optimal value for a given data set, thus choose the BBSB mixture that admits the optimal balance between
exibility and ecient mixing. The present work gives rise to interesting questions, such as how to characterize the distribution of K for BBSB priors and analyse its asymptotic behaviour as n ! 1, or even further study the underlying exchangeable partition probability functions. As to the orderings of the weights, it is also of interest to compute or approximate P[w > w ] for a xed value of , and to determine the rate at which j j+1 P[w > w ] ! 1 as ! 1. On a non-exchangeable context (e.g. Leisen and j j+1 Grin; 2017; De Iorio et al.; 2004), one could also use the Beta-Binomial transition to model dependence between two of more species sampling processes whose weights enjoy the stick-breaking decomposition. Hopefully, the present paper motivates the study of stick-breaking sequences featuring dependent l.v.'s, that might even lead to other type of priors. 7 Acknowledgements The rst author gratefully thanks the support of CONACyT PhD scholarship pro- gram and CONACyT project 241195. The second author gratefully acknowledges the support of CONTEX project 2018-9B as well as the hospitality of the University of Bath, where part of the project was done, during a Global Professor research visit. Appendix A. Appendix A.1. Convergence of probability measures To formally give the proof of the main results, we recall some topological details of measure spaces. For a Polish space S, with Borel -algebra B(S), we denote by P (S) the space of all probability measures over (S;B(S)). A well-known metric on P (S) is the L evy-Prokhorov metric given by 0 0 " 0 " d (P; P ) = inff" > 0 : P (A) P (A ) + "; P (A) P (A ) + ";8A 2 B(S)g; (17) 0 " for any P; P 2 P (S), and where A = fs 2 S : d(s; A) < "g, d(s; A) = inffd(a; s) : a 2 Ag and d is some complete metric on S. For probability measures P; P ; P ; : : : 1 2 it is said that P converges weakly to P , denoted by P ! P , whenever fdP ! n n n fdP for every continuous bounded function f : S ! [0;1). This condition is known to be equivalent to d (P ; P ) ! 0, and to
!
, whenever
P and L n n n n P . P (S), equipped with the topology of weak convergence, is Polish again. Its Borel - eld, B(P (S)), can equivalently be de ned as the -algebra generated 17 by all the projection maps fP 7! P (B) : B 2 B(S)g. In this sense the random probability measures (measurable mappings from a probability space ( ;F;P) into (P (S);B(S))), ; ; ; : : :, are said to converge weakly, a.s. whenever (!) ! 1 2 n R R (!), for every ! outside a P-null set. Analogously, if fd ! fd for every S S continuous bounded function f : S ! [0;1), it is said that converges weakly in dw w dw distribution to , denoted by ! . Evidently, ! a.s. implies ! , n n n which, in turns is a necessary and sucient condition for ! . For further details see for instance Parthasarathy (1967), Billingsley (1968) or Kallenberg (2017). Appendix A.2. Proof of Proposition 2.1 a) Using elementary properties of conditional expectation and the fact that given x , v is conditionally independent of v , we obtain i i+1 i + x + v i i E[v jv ] = E[E[v jx ]jv ] = E v = : i+1 i i+1 i i i + + + + b) Notice that Var(v jv ) = E[Var(v jx )jv ] + Var(E[v jx ]jv ), with i+1 i i+1 i i i+1 i i + x v (1 v ) i i i Var(E[v jx ]jv ) = Var v = : i+1 i i i + + ( + + ) Now, note that E[( + x )( + x )jv ] = Cov( + x ; + x jv ) i i i i i i + E[ + x jv ]E[ + x jv ] i i i i = Var(x jv ) + ( + v )( + v ) i i i i = v (1 v ) + ( + v )( + (1 v )) i i i i Hence ( + x )( + x ) i i E[Var(v jx )jv ] = E v i+1 i i i ( + + ) ( + + + 1) v (1 v ) + ( + v )( + (1 v )) i i i i = ; ( + + ) ( + + + 1) and we can conclude the proof of b), ( + v )( + (1 v )) + v (1 v )( + + ) i i i i Var(v jv ) = : i+1 i ( + + ) ( + + + 1) c) We rst note that as a consequence of the joint reversibility of the Beta-Binomial chain, v Be( + x ; + x ) conditionally given x , thus i i i i " # + x E[v v ] = E[E[v v jx ]] = E[E[v jx ]E[v jx ]] = E ; i i+1 i i+1 i i i i+1 i + + conditioning on v , we obtain " # " " ## 2 2 + x + x i i E = E E v + + + + 2 2 + 2E[x jv ] + E[x jv ] i i i = E ( + + ) 2 2 + 2E[v ] + E[v ] + ( 1)E[v ] i i ( + + ) (2 + ) ( 1)( + 1) 2 2 = + + ( + + ) ; + ( + )( + + 1) 18 hence Cov(v ; v ) = E[v v ] E[v ]E[v ] = : i i+1 i i+1 i i+1 ( + ) ( + + 1)( + + ) d) The correlation simpli es as follows Cov(v ; v ) i i+1 = p p = : v ;v i i+1 + + Var(v ) Var(v ) i i+1 Appendix A.3. Proof of Proposition 2.2 To prove Proposition 2.2 we need some preliminary results. Lemma A.1 (Continuous mappings). Let S and T be Polish spaces. Let ; ; : : : be random elements taking values in S, with ! , and consider some measurable mappings f; f ; f : : : from S into T satisfying f (s ) ! f (s), for every s ! s in 1 2 n n n S. Then f ( ) ! f (). n n n n n Lemma A.2. Let
= (
;
; : : :),
= (
;
; : : :) be random sequences taking 1 2 1 2 values in a Polish space S. Then
!
if and only if n n (
; : : : ;
) ! (
; : : : ;
); for every i 1: 1 i 1 i Lemmas A.2 and A.1 are well-known result in probability theory, see for instance Theorems 4.27 and 4.29, respectively, in Kallenberg (2002). Lemma A.3. Let S and T be Polish spaces. Consider some random elements ;
;
; : : : and ; ; ; : : : taking values in S and T , respectively. Let be the 1 2 1 2 distribution of
and the distribution of
, also consider some regular versions, n n (j
) and (j
), of P[ 2 j
] and P[ 2 j
] respectively. If ! and for n n n n n w d every s ! s in S we have that (js ) ! (js), then (
; ) ! (
; ). n n n n n Proof: Let g : S T ! R be a continuous and bounded function. De ne f; f ; f ; : : : : 1 2 S ! R by Z Z f (s) = g(s; t) (dtjs) and f (s) = g(s; t)(dtjs) n n The rst thing we will prove is that f (s ) ! f (s) as s ! s: (18) n n n So let s ! s. Choose some random elements ; ; ; : : : with (js ) and n 1 2 n n n (js), this way, ! by hypothesis. De ne h; h ; h ; : : : : T ! R by h (t) = n 1 2 n g(s ; t) and h(t) = g(s; t). As g is continuous, we have that h (t ) = g(s ; t ) ! n n n n n g(s; t) = h(t), for every t ! t in T . By Lemma A.1 we obtain h ( ) ! h(), n n n which in turn implies g(s ; t) (dtjs ) = E[g(s ; )] = E[h ( )] n n n n n n n ! E[h()] = E[g(s; )] = g(s; t)(dtjs): 19 Since s ! s was arbitrary, this proves equation (18), which together with the hypothesis and by Lemma A.1 show that f (
) ! f (
). Particularly, n n Z Z g(s; t) (dtjs) (ds) = E[f (
)] n n n n Z Z ! E[f (
)] = g(s; t)(dtjs) (ds): (19) Note that the double integral in the left side of equation (19) coincides with E[g(
; )], whilst the one at the right side coincides with E[g(
; )]. That is, n n we have proven that E[g(
; )] ! E[g(
; )], for every continuous and bounded n n function g : S T ! R. Or equivalently (
; ) ! (
; ). n n Lemma A.4. Let (x ) be a sequence of random variables such that x n n1 n Bin(n; p ) for every n 1 and where p ! p in [0; 1]. Then n n n 2 ! p: Proof: For n 1, x 1 2p 2 2 E p = E x E[x ] + p n n n (20) p (1 p ) n n = + (p p) : By taking limits as n ! 1 in (20) we obtain lim E p = 0: n!1 Proof of Proposition 2.2: (i) Insomuch as the corresponding spaces are Borel, we may construct on some ^ ^ ^ ^ ^ probability space ( ;F;P) a Beta-Binomial chain (V; X) with parameters (0; ; ). ^ ^ Now, the elements of V are conditionally independent given X, and given that = 0, a:s: ^ ^ X = (0; 0; : : :), so we may think of X as if it was deterministic, which implies that the elements of V must be independent and Be(; ) distributed. () () (ii): For every 1, let V = v be a Beta chain with parameters i1 h i () () () (; ; ), and let v be some regular version of P v 2 v (which i i+1 i clearly does not depends on i). Further let Be(; ) and x (j) = . The rst thing we are interested in proving is that for every p ! p in [0; 1] we have that (jp ) ! (jp): (21) So, let p ! p in [0; 1], by Lemma A.4 and given that all the corresponding spaces ^ ^ ^ ^ are Borel, we may construct on a probability space ( ;F;P), with expectations E[], some pairs of r.v.'s (x ^ ; v ^ ) such that x ^ Bin(; p ), fv ^ jx ^ g Be( + x ^ ; + a:s: x ^ ), and x ^ = ! p. Note that marginally v ^ (jp ) so to prove equation (21), it suces to show v ^ ! p. Conditionally given x ^ the moment generator function of v ^ is 1 k 1 h i X Y + x ^ + r t tv E e x ^ = 1 + ; t 2 R: (22) + + + r k! k=1 r=0 20 a:s: By construction we have that x ^ = ! p, which means that for every r 0, + x ^ + r + r x ^ + + r a:s: = + + 1 ! p; (23) + + + r as ! 1. Hence by the tower property of conditional expectation, equations (22) and (23), and Lebesgue dominated convergence theorem (the corresponding functions are dominated by e ) we obtain h i h h ii tv^ tv^ ^ ^ ^ lim E e = lim E E e jx ^ !1 !1 " ! # 1 k 1 X Y + x ^ + r t = E 1 + lim !1 + + + r k! k=1 r=0 " # (pt) = E 1 + k! k=1 tp = e ; which proves altogether v ^ ! p and equation (21). () Returning to the original Beta chains, we have that v = for every 1, so () trivially, v ! , this together with equation (21) and the recursive application of Lemma A.3 allows us to obtain () () d v ; : : : ; v ! (; : : : ; ); i 1; 1 i () d () and by Lemma A.2 we can conclude V = v ! (; ; : : :). i1 Appendix A.4. Proof of Proposition 3.1 For sequences that enjoy the decomposition (3) we may equivalently prove that j j X Y a:s: 1 w = (1 v ) ! 0; i i i=1 i=1 as j ! 1 (see for instance Ghosal and van der Vaart; 2017). Further, these r.v.'s are non-negative and bounded by 1, thus it is enough to show that " # lim E (1 v ) = 0: (24) j!1 i=1 As the corresponding spaces are Borel, (after possibly enlarging the original prob- ability space) it is possible to construct a Binomial chain X such that (V; X) de nes a Beta-Binomial chain. Conditionally given X = fx g , the elements i i1 of V = fv g are independent with, fv jx g Be( + x ; + x ) and i i1 1 1 1 1 21 fv jx ; x g Be( + x + x ; + 2 x x ), for i 1. Hence i+1 i i+1 i i+1 i i+1 " # " " ## j j Y Y E (1 v ) = E E (1 v ) X i i i=1 i=1 " # = E E[(1 v )jx ] E [(1 v )jx ; x ] 1 1 i i 1 i i=2 " # + x + 2 x x 1 i i 1 = E : + + + + 2 i=2 Recalling that 0 x a.s. we obtain " # j 1 j 1 + + 2 E (1 v ) ; + + + + 2 + + + 2 i=1 for every j 1. Finally by taking limits as j ! 1 in the last equation, (24) follows. Appendix A.5. Proof of Theorem 3.2 To prove Theorem 3.2 we will rst prove a couple of elementary results. Lemma A.5. Let S be a Polish space and x some distinct s ; s ; : : : 2 S, let 1 2 p = (p ; p ; : : :) and q = (q ; q ; : : :) be elements of and de ne P = p 1 2 1 2 1 j s j1 and Q = q . Then for d as in equation (17) j s L j1 d (P; Q) jp q j: L j j j1 Proof: De ne "(p; q) = jp q j, by de nition of d , it suces to prove for all j j L j1 A 2 B(S) "(p;q) "(p;q) P (A) Q A + "(p; q); and Q(A) P A + "(p; q); (25) So let A 2 B(S) and set M = fj 1 : s 2 Ag, then A j X X X X P (A) = P (fs g) = p q + jp q j j j j j j j2M j2M j2M j2M A A A A Q(A) + "(p; q) "(p;q) Q A + "(p; q): "(p;q) Analogously, we have that Q(A) P A + "(p; q). Lemma A.6. For xed and distinct elements s ; s ; : : : 2 S, the mapping, 1 2 (w ; w ; : : :) 7! w ; 1 2 j s j1 from into P (S) is continuous with respect to the weak topology. 22 Proof: (n) (n) (n) Let w = w ; w ; : : : and w = (w ; w ; : : :) be any elements of such 1 2 1 1 2 P P (n) (n) (n) that w ! w , for every j 1. De ne P = w and P = w . j s j s j j j j1 j j1 By Lemma A.5 X X X (n) (n) (n) d P ; P jw w j w + w = 2; L j j j j j1 j1 j1 and by the general Lebesgue dominated convergence theorem we obtain X X (n) (n) (n) lim d P ; P = lim jw w j = lim jw w j = 0; j j j j n!1 n!1 n!1 j1 j1 which means that the mapping (w ; w ; : : :) 7! w is continuous. 1 2 j s j1 j Remark A.7. Despite the choice of the metric, , in , as long as generates (n) (n) the Borel -algebra, w ; w ! 0 implies jw w j ! 0, for every j 1. For this reason, in the above proof we did not discuss the details on the metric, of , that is being used. Proof of Theorem 3.2: The proof of (i) follows directly from Proposition 2.2 (i). To prove (ii), note that by Proposition 2.2 (ii) and given that all the corresponding spaces are Borel, we () () ^ ^ ^ ^ can construct on a probability space ( ;F;P), Beta chains V = v ^ with i1 () a:s: ^ ^ parameters (; ; ) and a Be(; ) such that v ^ ! , as ! 1, for every iid ^ ^ ^ i 1. De ne also an independent sequence, = , with P . Now, for j j 0 j1 1 set j 1 Y X () () () () () ^ ^ ^ ^ ^ w = v 1 v ; j 1; and = w ; j j i j i=1 j1 j 1 ^ ^ with the empty product equating to 1, also set ^ = 1 , so j1 d d () () ^ = ; 1 and ^ = : (26) () () () ^ ^ ^ As the mapping v ; : : : ; v 7! w is continuous, we have that 1 j j j 1 () a:s: ^ ^ w ^ ! 1 ; j 1: ^ ^ ^ For the sequence , the diuseness of P implies that for i 6= j, 6= a.s., since we 0 i j are dealing with a countable number of random variables, there exist some B 2 F such that P[B] = 1 and for every ! 2 B j 1 () ^ ^ ^ ^ w ^ (!) ! (!) 1 (!) ; j 1; and (!) 6= (!); i 6= j j i By Lemma A.6 X X j 1 () w ^ ^ w ^ (!) ! (!) 1 (!) ; ! 2 B ^ ^ (!) (!) j j j1 j1 w d () () that is, ^ ! ^ a.s., implying ^ ! ^ . Finally, by equation (26), the result follows. 23 Appendix A.6. Proof of Corollary 3.3 The proof of (i) can be found in Theorem 1 by Pitman (1996a). To prove (ii) note that we may write () () v 1 v j+1 j (k) () (k) () w = v ; w = w ; j 1; 1 1 j+1 j () hence h i h i () () () () () P w < w = P v 1 v < v : j+1 j j+1 j j By the second part of Proposition 2.2 and as the corresponding spaces are Borel, ^ ^ ^ ^ we may construct on some probability space, ( ;F;P), with expectations E[], Beta () chains, v ^ , with parameters (; ; ), and a Be(; ) satisfying i1 () ^ ^ v ^ ! (; ; : : :) a.s. i1 ^ ^ Then for j 1, there exist A 2 F with P[A] = 1 and such that for every ! 2 A, () () ^ ^ ^ ^ ^ v ^ (!) ! (!) and v ^ (!) ! (!). Fix ! 2 A, since (!)(1 (!)) < (!), j j+1 () () () 0 0 ^ ^ ^ we may choose such that for every > , v (!) 1 v (!) < v (!). As j+1 j j n o () () () ! was chosen arbitrarily in A we have that 1 v ^ 1 v ^ < v ^ ! 1 a.s., as j+1 j j ! 1. Finally, by Lebesgue dominated convergence theorem we obtain h i h n oi () () () () () lim P w < w = lim E 1 v 1 v < v j+1 j j+1 j j !1 !1 h n oi () () () = lim E 1 v ^ 1 v ^ < v ^ j+1 j j !1 h n oi () () () ^ ^ ^ = E lim 1 v 1 v < v j+1 j j !1 = 1: References Billingsley, P. (1968). Convergence of Probability Measures, Wiley series in proba- bility and statistics, John Wiley and Sons Inc. Bissiri, P. and Ongaro, A. (2014). On the topological support of species sampling priors, Electronic Journal of Statistics 8(1): 861{882. Blackwell, D. and MacQueen, J. (1973). Ferguson distributions via P olya urn schemes, The Annals of Statistics 1: 353{355. De Iorio, M. D., Muller, P., Rosner, G. L. and MacEachern, S. N. (2004). An ANOVA model for dependent random measures, Journal of the American Statis- tical Association 99: 205{215. Favaro, S., Lijoi, A., Nava, C., Nipoti, B., Prunster, I. and Teh, Y. (2016). On the stick-breaking representation for homogeneous NRMIs, Bayesian Analysis 11: 697{724. Favaro, S., Lijoi, A. and Prunster (2012). On the stick-breaking representation of normalized inverse Gaussian priors, Biometrika 99: 663{674. 24 Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems, 1(2): 209{230. Fuentes-Garc a, R., Mena, R. H. and Walker, S. G. (2009). A nonparametric depen- dent process for Bayesian regression., Statistics & Probability Letters 79(4): 1112{ Fuentes-Garc a, R., Mena, R. H. and Walker, S. G. (2010). A new Bayesian non- parametric mixture model, Communications in Statistics - Simulation and Com- putation 39(4): 669{682. Ghosal, S. and van der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference, Cambridge Series in Statistical and Probabilistic Mathematics, Cam- bridge University Press. Hatjispyros, J., Merkatas, C., Nicoleris, T. and Walker, S. (2018). Dependent mix- tures of geometric weights priors, Computational Statistics and Data Analysis 119: 1{18. Hjort, N., Holmes, C., Muller, P. and Walker, S. G. (2010). Bayesian Nonparamet- rics, Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press. Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors, Journal of the American Statistical Association 96(453): 161{173. Ishwaran, H. and Zarepour, M. (2002). Exact and approximate sum representations for the dirichlet process, Canadian Journal of Statistics 30(2): 269{283. James, L. F., Lijoi, A. and Prunster, I. (2009). Posterior analysis for normalized random measures with independent increments, Scandinavian Journal of Statistics 36(1): 76{97. Kallenberg, O. (2002). Foundations of Modern Probability, second edn, Springer, New York. Kallenberg, O. (2017). Random Measures, Theory and Applications, Vol. 77, Springer. Leisen, F. and Grin, J. (2017). Compound random measures and their use in Bayesian non-parametrics, Journal of the Royal Statistical Society: Series B (Sta- tistical Methodology) 79(2): 525{545. Lijoi, A., Mena, R. and Prunster, I. (2007). Controlling the reinforcement in Bayesian non-parametric mixture models, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(4): 715{740. McCloskey, T. (1965). A model for the distribution of individuals by species in an environment, Technical report, Michigan State University Department of Statis- tics. Mena, R., Ruggiero, M. and Walker, S. G. (2011). Geometric stick-breaking pro- cesses for continuous-time Bayesian nonparametric modelling, Journal of Statis- tical Planning and Inference 141: 3217{3230. 25 Mena, R. and Walker, S. G. (2009). On a construction of Markov models in contin- uous time, METRON - International Journal of Statistics LXVII: 303{323. Mena, R. and Walker, S. G. (2015). On the Bayesian mixture model and identi a- bility, Journal of Computational and Graphical Statistics 24: 1155{1169. Nieto-Barajas, L. E. and Walker, S. G. (2002). Markov Beta and Gamma processes for modelling hazard rates, Scandinavian Journal of Statistics 29(3): 413{424. Parthasarathy, K. R. (1967). Probability measures on metric spaces., Academic press, New York. Perman, M., Pitman, J. and Yor, M. (1992). Size-biased sampling of Poisson point processes and excursions, Probability Theory and Related Fields 92(1): 21{39. Pitman, J. (1996a). Random discrete distributions invariant under size-biased per- mutation, Advances in Applied Probability 28(2): 525{539. Pitman, J. (1996b). Some developments of the Blackwell-MacQueen urn scheme, in T. F. et al. (ed.), Statistics, Probability and Game Theory; Papers in honor of David Blackwell, Vol. 30 of Lecture Notes-Monograph Series, Institute of Mathe- matical Statistics, Hayward, California, pp. 245{267. Pitman, J. (2006). Combinatorial stochastic processes., Vol. 1875 of Ecole d' et e de probabilit es de Saint-Flour, rst edn, Springer-Verlag Berlin Heidelberg, New York. Pitman, J. and Yor, M. (1992). Arcsine laws and interval partitions derived from a stable subordinator, Proceedings of the London Mathematical Society s3- 65(2): 326{356. Pitt, M., Chat eld, C. and Walker, S. G. (2002). Constructing rst order stationary autoregressive models via latent processes., Scandinavian Journal of Statistics 29: 657{663. Sethuraman, J. (1994). A constructive de nition of Dirichlet priors, Statistica Sinica 4: 639{650. Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices, Communi- cations in Statistics-Simulation and Computation 36(1): 45{54.
http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.pngMathematicsarXiv (Cornell University)http://www.deepdyve.com/lp/arxiv-cornell-university/beta-binomial-stick-breaking-non-parametric-prior-D2KYly5iIJ