Fisher-Rao geometry of Dirichlet distributions

Alice Le Brigant; Stephen Preston; Stéphane Puechmorel

doi:10.1016/j.difgeo.2020.101702

Fisher-Rao geometry of Dirichlet distributions

Brigant, Alice Le;Preston, Stephen;Puechmorel, Stéphane 2020-05-12 00:00:00 ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL Abstract. In this paper, we study the geometry induced by the Fisher-Rao metric on the parameter space of Dirichlet distributions. We show that this space is a Hadamard manifold, i.e. that it is geodesically complete and has everywhere negative sectional curvature. An important consequence for applications is that the Fr echet mean of a set of Dirichlet distributions is uniquely de ned in this geometry. 1. Introduction The dierential geometric approach to probability theory and statistics has met increasing interest in the past years, from the theoretical point of view as well as in applications. In this approach, probability distributions are seen as elements of a dierentiable manifold, on which a metric structure is de ned through the choice of a Riemannian metric. Two very important ones are the Wasserstein metric, central in optimal transport, and the Fisher- Rao metric (also called Fisher information metric), essential in information geometry. Unlike optimal transport, information geometry is foremost concerned with parametric families of probability distributions, and de nes a Riemannian structure on the parameter space using the Fisher information matrix [14]. It was Rao who showed in 1945 [26] that the Fisher information could be used to locally de ne a scalar product on the space of parameters, and interpreted as a Riemannian metric. Later on, Cencov [13] proved that it was the only metric invariant with respect to sucient statistics, for families with nite sample spaces. This result has been extended more recently to non parametric distributions with in nite support [8, 9]. Information geometry has been used to obtain new results in statistical inference as well as gain insight on existing ones. In parameter estimation for example, Amari [3] shows that conditions for consistency and eciency of estimators can be expressed in terms of geometric conditions; in the presence of hidden variables, the famous Expectation-Maximisation (EM) algorithm can be described in an entirely geometric manner; and in order to insure invariance to dieomorphic change of parametrization, the so-called natural gradient [2] can be used to de ne accurate parameter estimation algorithms [21]. Another important use of information geometry is for the eective comparison and analy- sis of families of probability distributions. The geometric tools provided by the Riemannian framework, such as the geodesics, geodesic distance and intrinsic mean, have proved useful to interpolate, compare, average or perform segmentation between objects modeled by prob- ability densities, in applications such as signal processing [5], image [29, 4] or shape analysis [24, 31], to name a few. These applications rely on the speci c study of the geometries of usual parametric families of distributions, which has started in the early work of Atkinson and Mitchell. In [7], the authors study the trivial geometries of one-parameter families of distributions, the hyperbolic geometry of the univariate normal model as well as special cases of the multivariate normal model, a work that is continued by Skovgaard in [30]. The family of gamma distributions has been studied by Lauritzen in [19], and more recently by arXiv:2005.05608v2 [math.DG] 19 Nov 2020 2 ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL Arwini and Dodson in [6], who also focus on the log-normal, log-gamma, and families of bivariate distributions. Power inverse Gaussian distributions [35], location-scale models and in particular the von Mises distribution [28], and the generalized gamma distributions [27] have also received attention. In this work, we are interested in Dirichlet distributions, a family of probability densities de ned on the (n 1)-dimensional probability simplex, that is the set of vectors of R with non-negative components that sum up to one. The Dirichlet distribution models a random probability distribution on a nite set of size n. It generalizes the beta distribution, a two-parameter probability measure on [0; 1] used to model random variables de ned on a compact interval. Beta and Dirichlet distributions are often used in Bayesian inference as conjugate priors for several discrete probability laws [23, 16, 11], but also come up in a wide variety of other applications, e.g. to model percentages and proportions in genomic studies [33], distribution of words in text documents [20], or for mixture models [10]. Up to our knowledge, the information geometry of Dirichlet distributions has not yet received much attention. In [12], the authors give the expression of the Fisher-Rao metric for the family of beta distributions, but nothing is said about the geodesics or the curvature. In this paper, we give new results and properties for the geometry of Dirichlet distri- butions, and its sectional curvature in particular. The derived expressions depend on the trigamma function, the second derivative of the logarithm of the gamma function, however we will avoid using its properties when possible to obtain our results. Instead, we consider a more general metric written using a function f , for which we only make the strictly neces- sary assumptions. Section 2 gives the setup for our problem by considering the Fisher-Rao metric on the space of parameters of Dirichlet distributions. In Section 3, we consider the more general metric where f replaces the trigamma function, and show that it induces the geometry of a submanifold in a at Lorentzian space. This allows us to show geodesic com- pleteness, and that the sectional curvature is everywhere negative. Section 4 focuses on the two-dimensional case, i.e. beta distributions. 2. Fisher-Rao metric on the manifold of Dirichlet distributions Let denote the (n 1)-dimensional probability simplex, i.e. the set of vectors in R with non-negative components that sum up to one = fq = (q ; : : : ; q ) 2 R ; q = 1; q 0; i = 1; : : : ; ng: n 1 n i i i=1 The family of Dirichlet distributions is a family of probability distributions on parametrized by n positive scalars x ; : : : ; x > 0 (Figure 1), that admits the following probability density 1 n function with respect to the Lebesgue measure (x + : : : + x ) 1 n x 1 1 x 1 f (qjx ; : : : ; x ) = q : : : q : n 1 n 1 n (x ) : : : (x ) 1 n n n As an open subset of R , the space of parameters M = (R ) is a dierentiable manifold and can be equipped with a Riemannian metric de ned in its matrix form by the Fisher information matrix g (x ; : : : ; x ) = E log f (Qjx ; : : : ; x ) ; i; j = 1; : : : ; n; ij 1 n n 1 n @x @x i j FISHER-RAO GEOMETRY OF DIRICHLET DISTRIBUTIONS 3 Figure 1. Random samples drawn from Dirichlet distributions on the 2- dimensional simplex for dierent values of the parameters (x ; x ; x ). 3 1 2 3 where E denotes the expectation taken with respect to Q, a random variable with density f (jx ; : : : ; x ). The Dirichlet distributions form an exponential family and so the Fisher- n 1 n Rao metric is the hessian of the log-partition function [3], namely g (x ; : : : ; x ) = '(x ; : : : ; x ); i; j = 1; : : : ; n; ij 1 n 1 n @x @x i j where ' is the logarithm of the normalizing factor '(x ; : : : ; x ) = log (x ) log (x + : : : + x ): 1 n i 1 n i=1 We obtain the following metric tensor. 0 0 (1) g (x ; : : : ; x ) = (x ) (x + : : : + x ); ij 1 n i ij 1 n where is the Kronecker delta function, and denotes the digamma function, that is the ij rst derivative of the logarithm of the gamma function, i.e. (x) = log (x): dx Its derivative is called the trigamma function. As noted below, the trigamma function is a function whose reciprocal is increasing, convex, and sublinear on R . For slightly greater generality, and to emphasize what properties of this function are needed for our results, we will work in the sequel with a more general function f on which we make only the necessary assumptions; in our special case we have f = 1= . 3. The general framework 3.1. The metric. In this section we consider a more general geometry, that admits the Fisher-Rao geometry of Dirichlet distributions as a special case. The goal is to avoid using the properties of the trigamma function when possible. For this, we consider the quadrant M = (R ) equipped with a metric of the form 2 2 2 dx dx (dx + + dx ) 1 n 2 1 n (2) ds = + + ; f (x ) f (x ) f (x + + x ) 1 n 1 n where f : R ! R is a function on which we make the following assumptions: d f 2 0 2 00 (3) f (x) = O(x ); f (x) = O(x); f (x) = O(x ); f > 0 and > 0: 2 0 x!0 x!0 x!1 dx f 4 ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL We retrieve the Fisher-Rao metric (1) when f (x) = : (x) Notice that this choice for f satis es the conditions of (3). Indeed, that f (0) = f (0) = 0 0 2 comes from the asymptotic formula (x) x valid near x = 0, since 00 3 (x) 2x f (0) = lim = lim = 0: 0 2 4 x!0 x!0 (x) x The fact that the reciprocal of the trigamma function f (x) = is convex comes from (x) an argument of Trimble-Wells-Wright [32], based on an inequality later proved in Alzer- Wells [1]. The fact that f=f is convex comes from Yang [34]. Another example of a function satisfying the conditions (3) is (2x + 1)x f (x) = ; 2x + 2x + 1 a simple rational function which approximates the reciprocal of the trigamma function well, in both the small-x and large-x regions. Some useful consequences of our assumptions (3) are given in the following lemma. These results are well-known, but we include the simple proofs for completeness. Lemma 1. If f satis es (3), then we have f (x) > 0 and f (x) > 0 for all x > 0. In addition f and f=f are superadditive: (4) f (x + + x ) > f (x ) + + f (x ); 1 n 1 n f (x + + x ) f (x ) f (x ) 1 n 1 n (5) > + + ; 0 0 0 f (x + x ) f (x ) f (x ) 1 n 1 n for all x ; : : : ; x > 0. 1 n 00 0 Proof. That f > 0 implies f > 0 and thus f > 0 for all x > 0 is obvious. It has been known since Petrovich [25] that a convex function f with f (0) = 0 is superadditive: an easy argument in the dierentiable case is that Z Z x y f (x + y) f (x) f (y) = f (s + t) dt ds 0: 0 0 By induction the general case (4) follows. Since lim f (x)=f (x) = 0, the same argument x!0 applies to f=f to give (5). 3.2. Lorentzian submanifold geometry. We now show that after a change of coor- dinates, M can be seen as a codimension 1 submanifold of the (n + 1)-dimensional at n+1 n+1 2 Minkowski space L = (R ; ds ), where 2 2 2 2 (6) ds = dy + : : : + dy dy : L 1 n n+1 In the sequel, we will denote by h;i the scalar product induced by this metric. Proposition 2. The mapping n+1 : M ! L ; (x ; : : : ; x ) 7! ((x ); : : : ; (x ); (x + : : : + x )) 1 n 1 n 1 n where : R ! R is de ned by dr (x) = p ; f (r) 1 FISHER-RAO GEOMETRY OF DIRICHLET DISTRIBUTIONS 5 is an isometric embedding. 1 00 2 Proof. Since f (x) f (0)x for x 0, we see that dr p = 1; f (r) so that the image of must include all negative reals. Therefore maps R bijectively to (1; N ) for some N 2 (0;1]. The behavior of f at in nity assumed in (3) implies that f (x) Cx for all x K , for some K > 0, C > 0, which in turns leads to dx p = 1: f (x) Therefore maps bijectively R to R, and is a homeomorphism onto its image. Since (x) > 0 for all x, it is also an immersion. Finally, if (y ; : : : ; y ) = (x ; : : : ; x ), 1 n+1 1 n 2 2 dx (dx + : : : + dx ) 1 n 2 0 2 2 i 2 dy = (x ) dx = ; i = 1; : : : ; n; dy = ; i i n+1 f (x ) f (x + : : : + x ) i 1 n and is isometric. n+1 Proposition 3. S = (M ) is a codimension 1 submanifold of L given by the graph of (7) y = ((y ) + : : : + (y )); y > 0; n+1 1 n i where = . On this submanifold the metric is positive-de nite and thus Riemannian. A basis of tangent vectors of T S is de ned by @ f (y ) @ (8) e = + ; i = 1; : : : ; n; @y f (y ) @y i n+1 n+1 Proof. Let (u) = (y (u); : : : ; y (u)) be a parametrized curve in S. Then its coordinates 1 n+1 verify the following relations y = ((y ) + : : : + (y )); n+1 1 n 0 0 0 0 0 0 y = ((y ) + : : : + (y ))( (y )y + : : : + (y )y ); 1 n 1 n n+1 1 n and so, since (x) = f ((x)), @ @ 0 0 0 0 0 0 0 (u) = y (u) + ((y (u)))( (y (u))y (u) + : : : + (y (u))y (u)) n+1 1 n i 1 n @y @y i n+1 i=1 @ f (y (u)) @ = y (u) + p ; @y @y i f (y (u)) n+1 n+1 i=1 yielding (8) as basis tangent vectors. The metric components on S take the form g = he ; e i = W W ; or g = I WW ; ij i j ij i j where h;i denotes the at Minkowskian metric (6) and W = f ((y ))=f ((y )) for i i n+1 i = 1; : : : ; n. Applying Lemma 16 of the appendix gives the result upon computing f (y ) i=1 W W = P < 1; f (y ) i=1 by superadditivity of f , as in (4). 6 ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL In other words, the metric (2) is the restriction of the at Lorentzian metric 2 2 2 dx dx dt 2 1 n ds = + + f (x ) f (x ) f (t) 1 n to the hyperplane t = x + + x . In the sequel, we will use this Lorentzian submanifold 1 n geometry to study the sectional curvature and geodesic completeness of M . We state the results in the original coordinate system of M when possible, using the following notations for any y = (y ; : : : ; y ) 2 S: 1 n+1 (9) x = (y ); i = 1; : : : ; n; t = x + : : : + x = (y ): i i 1 n n+1 3.3. Negative sectional curvature. The goal of this section is to prove that the sectional curvature of M is everywhere negative. We start by computing the shape operator. Proposition 4. The shape operator of S = (M ) has the following components in the basis (8) of tangent vectors 1 f (t) h(e ); e i = p f (x ) f (x )f (x ) : i j i ij i j f (t) 2 f (t) f (x ) `=1 Proof. We rst observe that the basis vectors (8) can be expressed in coordinates (9) as @ f (x ) @ (10) e = + ; i = 1; : : : ; n: @y f (t) @y i n+1 Since S can be obtained as the graph of F (y ; : : : ; y ) = ((y ); : : : ; (y )), a normal vector 1 n 1 n eld to S at y is given by n n X X @F @ @ f (x ) @ @ (11) N = + = + ; @y @y @y f (t) @y @y i i n+1 i n+1 i=1 i=1 which yields a timelike vector since (12) hN; Ni = (f (x ) + : : : + f (x ) f (t)) < 0; 1 n f (t) by superadditivity of f . Since hN; e i = 0, the shape operator is then given by N hr N; e i e j (13) h(e ); e i = hr p ; e i = p ; i j e j hN; Ni hN; Ni r is the at connection of the Minkowski space. Denoting @ = @=@y , we get from (10), i i (11) and the atness of r, f (x ) r N = r N + r N = @ @ F@ : e @ @ i j j i i n+1 f (t) j=1 Inserting this last equation along with (12) into (13) yields f (t) h(e ); e i = @ @ F: i j i j f (t) f (x ) `=1 FISHER-RAO GEOMETRY OF DIRICHLET DISTRIBUTIONS 7 Straightforward computations give 0 0 @ F = (t) (y ) = f (x )=f (t); i i i 1 f (t) 00 0 0 0 00 0 @ @ F = (t) (y ) (y ) + (t) (y ) = p f (x )f (x ) + f (x ) ; i j i j i ij i j i ij f (t) 2 f (t) and the result follows after simpli cation. Corollary 5. The second fundamental form given by Proposition 4 is positive-de nite. Proof. This follows from Lemma 16 and the decomposition of the matrix with components = h(e ); e i as ij i j = k(D cV V ); where D = diag(d ; : : : ; d ) is a diagonal matrix, V = (v ) is a column vector and c 1 n i 1in and k are constants, de ned for i = 1; : : : ; n by 1 f (t) (14) d = f (x ); v = f (x ); k = p ; c = : i i i i P f (t) f (t) f (x ) `=1 Recalling that f > 0 and f > 0 by Lemma 1, we see that the matrix D and constant c are positive. There remains to verify that f (t) f (x ) T 1 cV D V = < 1; f (t) f (x ) i=1 by the superadditivity property (5). We can now show our main result. Theorem 6. The sectional curvature of the Riemannian metric (2) is negative on M . Proof. We use a result from O'Neill [22, Chapter 4, Corollary 20], which states that if the normal vector eld N of a hypersurface M in a at Lorentzian manifold L is timelike, then the sectional curvature of the submanifold is given by h(U ); Uih(V ); Vih(U ); Vi (15) K (U; V ) = ; hU; UihV; VihU; Vi where U and V are tangent to the submanifold and is the shape operator. The result now follows by the Cauchy-Schwarz inequality: since is a positive-de nite symmetric matrix, we know that h(U ); Uih(V ); Vi h(U ); Vi with equality i V is a multiple of U , but in that case the denominator vanishes as well. So the sectional curvature must be strictly negative. We now give more speci cally the formula of the sectional curvature of the planes gener- ated by the basis tangent vectors (8). Proposition 7. The sectional curvature along the axes de ned by (8) is given by 0 0 0 0 0 0 f (x )f (x )f (t) + f (x )f (x )f (t) f (x )f (x )f (t) i j i j i j K (e ; e ) = : i j n 4(f (t) f (x ))(f (t) f (x ) f (x )) ` i j `=1 8 ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL Proof. This follows from applying formula (15) for the sectional curvature of a hypersurface in a at Lorentzian manifold, with f (x ) f (x )f (x ) i i j he ; e i = 1 ; he ; e i = ; i 6= j; i i i j f (t) f (t) and h(e ); e i given by Proposition 4. i j Finally, we state a result about the eigenvalues of the shape operator, which will be useful to show geodesic completeness in the next section. Proposition 8. The principal curvatures at any point in S = (M ) are bounded. Proof. The principal curvatures at a given point (y ; : : : ; y ) = (x ; : : : ; x ) 2 S are the 1 n+1 1 n eigenvalues of the shape operator = k(D cV V ); where D, V , c and k are de ned by (14). Without loss of generality, we assume that the n-tuple (x ; : : : ; x ) is ordered. Let us rst show that when at least n 1 variables go to 1 n zero, i.e. x ! 0 for i = 1; : : : ; n 1 with the previous assumption, the principal curvatures go to zero. Let = x + : : : + x , then t = x + and 1 n1 n f (t) f (x ) : : : f (x ) f (x ) 1 n n !0 since f has limit zero in zero, and so k p : !0 f (x ) Using the fact that when ! 0, recalling assumptions (3), 2 0 f (x ) = O(x ); f (x ) = O(x ); x = O( ); i = 1; : : : ; n 1; i i i i we see that the diagonal terms of D cV V behave as f (t) f (x ) f (x ) = O( ); i = 1; : : : ; n 1; i i f (t) 0 0 0 0 2 f (t) f (x )f (x + ) f (x + )f (x ) f (x ) n n n n n 0 00 f (x ) f (x ) = f (x ) + ; n n n !0 f (t) f (x + ) f (x ) n n while the antidiagonal terms verify f (t) f (x )f (x ) = O( ); 1 i; j n 1; i j f (t) f (t) f (x )f (x ) = O( ): i n f (t) Finally, we obtain that = h(e ); e i = O( ); 1 i; j n; ij i j !0 and so the principal curvatures go to zero when ! 0. Therefore there exists > 0 such that, at any point (x ; : : : ; x ) belonging to the set 1 n D = f(x ; : : : ; x ) 2 (R ) ; x < for at least n 1 indices i 2 f1; : : : ; ngg; 1 n i + FISHER-RAO GEOMETRY OF DIRICHLET DISTRIBUTIONS 9 the principal curvatures are upper bounded by, say, 1. Now let us consider an n-tuple (x ; : : : ; x ) 2= D , ordered as before. Then the diagonal elements of D are ordered as well 1 n 0 T since f is increasing, and the ordered eigenvalues of k(D cV V ) verify 0 : : : kd ; 1 n n where the lower bound comes from the positive-de niteness shown in Corollary 5, and the 0 0 upper bound comes from [15]. Since d = f (x ) and f is increasing and upper bounded n n by lim f (x) = 1, we have that d 1. Since the function x!1 (x ; : : : ; x ) 7! f (x + : : : + x ) f (x ) : : : f (x ) 1 n 1 n 1 n is increasing in all of its variables, it is larger than its limit as the rst n 2 variables go to zero, and since at least x > and x > , we obtain n1 n 1 1 p p k = ; f (t) f (x ) : : : f (x ) f (2) 2f () 1 n and the principal curvatures are again bounded. 3.4. Geodesics and geodesic completeness. The geodesics of M for the metric (2) are parametrized curves u 7! (x (u); : : : ; x (u)) solution of the standard second-order ODEs 1 n x + x _ x _ = 0; k = 1; : : : ; n; k i j ij 1i;jn whose coecients can be computed using the following result. Proposition 9. The Christoel symbols for metric (2) are given by 1 f (x ) = (g(t) g(x ) ) g(x ) ; n j ij k ij jk ij 2 f (t) f (x ) `=1 where t = x +: : : +x and g(x) = f (x)=f (x), while denotes the Kronecker delta function. 1 n Proof. The Christoel symbols of the second kind can be obtained from the Christoel ij ij symbols of the rst kind and the coecients g of the inverse of the metric matrix ijk using the formula k kl = g ; ijl ij where we have used the Einstein summation convention. It is easy to see that the Christoel symbols of the rst kind are given by 0 0 1 f (t) f (x ) = : ijk ik jk 2 2 2 f (t) f (x ) Applying the Sherman-Morrison formula, we obtain that the inverse of the metric matrix 1 1 1 g(x ; : : : ; x ) = diag ; : : : ; J; 1 n f (x ) f (x ) f (t) 1 n where J denotes the n-by-n matrix with all entries equal to one, is given by 1 1 (16) g(x ; : : : ; x ) = diag(f (x ); : : : ; f (x ) ) + P [f (x )f (x )] : 1 n 1 n n i j 1i;jn f (t) f (x ) `=1 Noticing that the sum of all the elements of the kth line (or column) of the inverse of the metric matrix is given by f (x )f (x ) f (x )f (t) k ` k k` `=1 P P (17) g = f (x ) + = ; ` n n f (t) f (x ) f (t) f (x ) ` ` `=1 `=1 `=1 10 ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL we obtain n n 0 0 0 0 X X 1 f (t) f (x ) 1 f (t) 1 f (x ) ` j k k` k` kj = g = g g : ij j` ij ij 2 2 2 2 2 f (t) f (x ) 2 f (t) 2 f (x ) ` j `=1 `=1 Inserting (17) and the general term of the inverse matrix (16) in the above yields 0 0 1 f (t) f (x )f (t) 1 f (x ) f (x )f (x ) k j k j P P = f (x ) + ij k kj ij n n 2 2 2 f (t) f (t) f (x ) 2 f (x ) f (t) f (x ) ` j ` `=1 `=1 and the result follows. Now, using the result of Proposition 8 and a theorem from [17], we can show that M is geodesically complete. Theorem 10. M equipped with the Riemannian metric (2) is geodesically complete. n+1 Proof. The image of M by is a hypersurface of the (n + 1)-Minkowski space L . More- n+1 over, is an embedding and it is closed since (M ) is a closed subset of L as preimage of the singleton f0g by the continuous map (y ; : : : ; y ) 7! (y ) + : : : + (y ) (y ). 1 n+1 1 n n+1 Therefore is proper [17, Theorem 1]. Then, [17, Theorem 6] allows us to conclude that since has bounded principal curvatures by Proposition 8, M equipped with the pullback (2) of the Minkowski metric by is complete. 3.5. Uniqueness of the Fr echet mean. Since M is simply connected, we deduce from Theorems 6 and 10 the following. Corollary 11. M equipped with the Riemannian metric (2) is a Hadamard manifold. This has important implications in information geometry, as it guarantees the uniqueness of the Fr echet mean of a set of points in this geometry. The Fr echet mean, also called intrinsic mean, is a popular choice to extend the notion of barycenter to a Riemannian manifold. It is de ned for a set of points p ; : : : ; p 2 M as the minimizer of the sum of the squared 1 N geodesic distances to the points of the set p = argmin d(p; p ) : p2M i=1 It exists as long as M is complete, however it is in general not unique and refers to a set. Uniqueness holds however for Hadamard manifolds [18]. This implies that the notion of barycenter of Dirichlet distributions is well de ned in the Fisher-Rao geometry. 4. The two-dimensional case of beta distributions The simplest case is obviously when n = 2, and even in this case the formulas are nontrivial. When f = , the metric comes from the well-known two-parametric family of beta distributions de ned on the compact interval [0; 1], which is important in statistics and useful in many applications. Proposition 12. The geodesic equations are given by 2 2 a(x; y)x + b(x; y)x _ + c(x; y)x _y _ + d(x; y)y _ = 0; (18) 2 2 a(y; x)y + b(y; x)y _ + c(y; x)x _y _ + d(y; x)x _ = 0; FISHER-RAO GEOMETRY OF DIRICHLET DISTRIBUTIONS 11 Figure 2. On the left, geodesic balls and on the right, sectional curvature of the manifold of beta distributions (n = 2). Figure 3. On the left, geodesic between the beta distributions of parame- ters (2; 5) and (2; 2) and on the right, Fr echet mean (full red line) compared to the Euclidean mean (dashed red line) of the beta distributions of param- eters (2; 5), (2; 2) and (5; 1), shown in terms of probability density function. where a(x; y) = 2 f (x + y) f (x) f (y) b(x; y) = f (y)g(x) + f (x)g(x + y) f (x + y)g(x) c(x; y) = 2f (x)g(x + y) d(x; y) = f (x)g(x + y) g(y)f (x); with the shorthand g(x) = f (x)=f (x). Proof. The geodesic equations can be expressed in terms of the Christoel symbols as 1 2 1 1 2 x + x _ + 2 x _y _ + y _ = 0; 11 12 22 2 2 2 2 2 y + y _ + 2 x _y _ + x _ = 0; 22 12 11 and the coecients can be computed using Proposition 9. No closed form is known for the geodesics, but they can be computed numerically by solving (18), see the left-hand side of Figure 2. Nonetheless we can notice that, due to the symmetry of the metric with respect to parameters x and y, both equations in (18) yield a unique ordinary dierential equation when x = y. 12 ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL Corollary 13. Solutions of the geodesic equation (18) with x(0) = y(0) and x _ (0) = y _ (0) satisfy (19) q(x(t))x _ (t) = constant; 1 2 where q(x) = , and thus can be found by quadratures. f (x) f (2x) Proof. If at some time t we have x = y and x _ = y _ , then the equations (18) imply that x = y at t . Dierentiating repeatedly in time shows that all higher derivatives must also be equal at t , and we conclude by analyticity of the solutions that x(t) = y(t) on some interval. The usual extension arguments for ODEs then imply that x(t) = y(t) on the entire domain of the solution, which by Theorem 10 is R. When x = y equation (18) reduces to 0 0 4f (x)f (2x) f (2x)f (x) 2 f (2x) 2f (x) x + x _ = 0; f (2x) f (x) which is equivalent to 0 2 2q(x)x + q (x)x _ = 0: This clearly implies the conservation law (19). The dierential equation (19) can then be solved by writing t = p q(s) ds x _ q(x ) 0 0 0 and inverting the resulting function. For example, if f (x) = 1= (x), then the duplication formula for the trigamma function implies 0 0 1 0 0 1 q(x) = (x) 2 (2x) = [ (x) (x + )]: 2 2 1 1 Asymptotically this looks like q(x) for x 0 and q(x) as x ! 1. We conclude 2 2 2x 4x that it takes in nite time for a geodesic along the diagonal to either reach \diagonal in nity" or the origin, as Theorem 10 of course implies. From an applications point of view, the geodesics for the Fisher-Rao geometry allow us to de ne a notion of optimal interpolation between beta and more generally Dirichlet distributions. An example of such an optimal interpolation is shown on the left-hand side of Figure 3; in terms of probability density function. Now we give the formula for the sectional curvature in two dimensions. Proposition 14. If n = 2, the sectional curvature is given by 0 0 0 0 0 0 1 f (t)f (x)f (y) f (x)f (t)f (y) f (y)f (t)f (x) (20) K (x; y) = ; f (t) f (x) f (y) where t = x + y. Proof. This is just a particular case of Proposition 7. Notice that in two dimensions, the negativity of the sectional curvature is straightforward, as there is only one Gaussian curvature to consider, which is given by (20), in which one 0 0 0 can easily see that the numerator is positive by factorizing by f (x)f (y)f (t) > 0 and using the superadditivity property (5) of f=f . As previously mentioned, the negative curvature of the Fisher-Rao geometry also has interesting implications for applications: it entails that the Fr echet mean of a set of beta, or more generally Dirichlet distributions is well de ned. An example of Fr echet mean of FISHER-RAO GEOMETRY OF DIRICHLET DISTRIBUTIONS 13 Figure 4. The dierence (23) between the sectional curvatures of the plane generated by e and e in two and three dimensions changes sign for z = 1 2 0:01. beta distributions is shown in terms of probability density function on the right-hand side of Figure 3. Numerically we observe that when f = 1= , the function K (x; y) given by (20) is decreasing in both the x and y variables { see the right-hand side of Figure 2 { but we do not yet have a proof of this fact. However we may analyze the asymptotics of the function relatively easily. Proposition 15. If f = 1= , then the asymptotic behavior of the sectional curvature given by (20) approaching the boundary square is given by 0 000 3 (x) (x) (21) lim K (x; y) = lim K (y; x) = ; 00 2 y!0 y!0 4 2 (x) 00 0 x (x) + (x) (22) lim K (x; y) = lim K (y; x) = : 0 2 y!1 y!1 4(x (x) 1) Moreover, we have the following limits at the asymptotic corners: 1 1 lim K (x; y) = 0; lim K (x; y) = ; lim K (x; y) = lim K (x; y) = : x;y!0 x;y!1 2 x!0;y!1 x!1;y!0 4 A(x;y) Proof. Writing K (x; y) = , with 4B(x;y) 0 0 0 0 0 0 A(x; y) = f (x + y)f (x)f (y) f (x)f (x + y)f (y) f (y)f (x + y)f (x); B(x; y) = f (x + y) f (x) f (y); we note that A(x; 0) = M (x; 0) = 0 and N (x; 0) = 0, so that A (x; 0) yy lim K (x; y) = ; y!0 8B (x; 0) which gives (21) after rewriting in terms of . 1 0 For the in nite limits, we use the facts that lim f (y) y = and lim f (y) = 1, y!1 y!1 and that lim y(f (y) 1) = 0, to obtain limits of A(x; y) and B(x; y) separately with y!1 elementary computations. These limits and strong numerical evidence allow us to conjecture that the sectional curvature in two dimensions is lower bounded by 1=2. Comparing the two-dimensional sectional curvature K (x; y) = K (x; y) with the sectional curvature of the plane generated 14 ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL by e and e in three dimensions, that we denote by K (x; y; z), we observe numerically 1 2 3 that for a given z > 0, the function (23) (x; y) 7! K (x; y; z) K (x; y) 3 2 does not have a xed sign in general, as can be observed on Figure 4 for small values of x, y and z. Acknowledgments S. C. Preston was partially supported by Simons Foundation, Collaboration Grant for Mathematicians, no. 318969. A. Le Brigant and S. Puechmorel would like to thank Fabrice Gamboa and Thierry Klein for bringing this problem to their attention and for fruitful discussions. Appendix Here we give a well-known principle to establish positivity of matrices. Lemma 16. Suppose A is a positive-de nite symmetric matrix, V is a vector, and c is a positive real number. Then B = A cV V is positive-de nite if and only if T 1 cV A V < 1: Proof. Since A is positive-de nite and symmetric, we may write A = P for some positive- de nite symmetric matrix P . Let X = P V ; then we may write 2 T 1 1 T T B = P cV V = P I c(P V )(P V ) P = P (I cXX )P: T n Denoting by hUjUi = U U the usual scalar product on R , we have for any vector U , 2 2 2 hUjBUi = hPUjPUi chPUjXi = jYj chYjXi 2 2 2 2 2 jYj cjXj jYj = jYj (1 cjXj ); where Y = PU , using the Cauchy-Schwarz inequality. This is positive for all U if and 2 2 only if the right side is positive for all Y , which translates into cjXj < 1. Since jXj = 1 1 1 hP VjP Vi = hVjA Vi, we obtain the claimed result. References [1] Horst Alzer and Jim Wells. Inequalities for the polygamma functions. SIAM Journal on Mathematical Analysis, 29(6):1459{1466, 1998. [2] Shun-Ichi Amari. Natural gradient works eciently in learning. Neural computation, 10(2):251{276, [3] Shun-ichi Amari. Information geometry and its applications, volume 194. Springer, 2016. [4] Jesus Angulo and Santiago Velasco-Forero. Morphological processing of univariate gaussian distribution- valued images based on poincar e upper-half plane representation. In Geometric Theory of Information, pages 331{366. Springer, 2014. [5] Marc Arnaudon, Fr ed eric Barbaresco, and Le Yang. Riemannian medians and means with applications to radar signal processing. IEEE Journal of Selected Topics in Signal Processing, 7(4):595{604, 2013. [6] Khadiga Arwini and Christopher TJ Dodson. Information geometry: near randomness and near inde- pendence. Springer Science & Business Media, 2008. [7] Colin Atkinson and Ann FS Mitchell. Rao's distance measure. Sankhy a: The Indian Journal of Statis- tics, Series A, pages 345{365, 1981. [8] Nihat Ay, Jurgen Jost, H^ ong V^ an L^ e, and Lorenz Schwachh ofer. Information geometry and sucient statistics. Probability Theory and Related Fields, 162(1-2):327{364, 2015. [9] Martin Bauer, Martins Bruveris, and Peter W Michor. Uniqueness of the Fisher{Rao metric on the space of smooth densities. Bulletin of the London Mathematical Society, 48(3):499{506, 2016. FISHER-RAO GEOMETRY OF DIRICHLET DISTRIBUTIONS 15 [10] Nizar Bouguila, Djemel Ziou, and Jean Vaillancourt. Unsupervised learning of a nite mixture model based on the dirichlet distribution and its application. IEEE Transactions on Image Processing, 13(11):1533{1543, 2004. [11] Andrew H Briggs, AE Ades, and Martin J Price. Probabilistic sensitivity analysis for decision trees with multiple branches: use of the dirichlet distribution in a bayesian framework. Medical Decision Making, 23(4):341{350, 2003. [12] Ovidiu Calin and Constantin Udri ste. Geometric modeling in probability and statistics. Springer, 2014. [13] Nikolai Nikolaevich Cencov. Statistical decision rules and optimal inference. transl. math. Monographs, American Mathematical Society, Providence, RI, 1982. [14] Ronald A Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 222(594-604):309{368, 1922. [15] Gene H Golub. Some modi ed matrix eigenvalue problems. Siam Review, 15(2):318{334, 1973. [16] Tom Griths. Gibbs sampling in the generative model of latent dirichlet allocation. 2002. [17] Stephen G Harris. Closed and complete spacelike hypersurfaces in minkowski space. Classical and Quantum Gravity, 5(1):111, 1988. [18] H. Karcher. Riemannian center of mass and molli er smoothing. Communications on pure and applied mathematics, 30(5):509{541, 1977. [19] Stefan L Lauritzen. Statistical manifolds. Dierential geometry in statistical inference, 10:163{216, [20] Rasmus E Madsen, David Kauchak, and Charles Elkan. Modeling word burstiness using the dirichlet distribution. In Proceedings of the 22nd international conference on Machine learning, pages 545{552, [21] Yann Ollivier. True asymptotic natural gradient optimization. arXiv preprint arXiv:1712.08449, 2017. [22] Barrett O'neill. Semi-Riemannian geometry with applications to relativity. Academic press, 1983. [23] Philip D O'Neill and Gareth O Roberts. Bayesian inference for partially observed stochastic epidemics. Journal of the Royal Statistical Society: Series A (Statistics in Society), 162(1):121{129, 1999. [24] Adrian Peter and Anand Rangarajan. Shape analysis using the sher-rao riemannian metric: Unifying shape representation and deformation. In 3rd IEEE International Symposium on Biomedical Imaging: Nano to Macro, 2006., pages 1164{1167. IEEE, 2006. [25] M Petrovich. Sur une fonctionnelle. Publ. Math. Beograd, TL, 1932. [26] C Radhakrishna Rao. Information and the accuracy attainable in the estimation of statistical parame- ters. Bull. Calcutta Math. Soc., 37, 01 1945. [27] Sana Rebbah, Florence Nicol, and St ephane Puechmorel. The geometry of the generalized gamma manifold and an application to medical imaging. Mathematics, 7(8):674, 2019. [28] Salem Said, Lionel Bombrun, and Yannick Berthoumieu. Warped riemannian metrics for location-scale models. In Geometric Structures of Information, pages 251{296. Springer, 2019. [29] Olivier Schwander and Frank Nielsen. Model centroids for the simpli cation of kernel density estimators. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 737{740. IEEE, 2012. [30] Lene Theil Skovgaard. A Riemannian geometry of the multivariate normal model. Scandinavian Journal of Statistics, pages 211{223, 1984. [31] Anuj Srivastava, Ian Jermyn, and Shantanu Joshi. Riemannian analysis of probability density functions with applications in vision. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1{8. IEEE, 2007. [32] SY Trimble, Jim Wells, and FT Wright. Superadditive functions and a statistical application. SIAM journal on mathematical analysis, 20(5):1255{1259, 1989. [33] Shengping Yang and Zhide Fang. Beta approximation of ratio distribution and its application to next generation sequencing read counts. Journal of applied statistics, 44(1):57{70, 2017. [34] Zhen-Hang Yang. Some properties of the divided dierence of psi and polygamma functions. Journal of Mathematical Analysis and Applications, 455(1):761 { 777, 2017. [35] Zhenning Zhang, Huafei Sun, and Fengwei Zhong. Information geometry of the power inverse gaussian distribution. Applied Sciences, 9, 2007. 16 ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL SAMM 4543, Universite Paris 1 Pantheon Sorbonne, Centre PMF, Paris, France. Email address : alice.le-brigant@univ-paris1.fr Department of Mathematics, Brooklyn College and CUNY Graduate Center, New York, USA. Email address : stephen.preston@brooklyn.cuny.edu Ecole Nationale de l'Aviation Civile, Universite de Toulouse, Toulouse, France. Email address : stephane.puechmorel@enac.fr http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Mathematics arXiv (Cornell University) http://www.deepdyve.com/lp/arxiv-cornell-university/fisher-rao-geometry-of-dirichlet-distributions-zR2fRnOy2s

Loading next page...

References (36)

S. Amari, O. Barndorff-Nielsen, R. Kass, S. Lauritzen, Calyampudi Rao (1987)
Chapter 4: Statistical Manifolds
Jingfeng Tian, Zhen-Hang Yang (2017)
New properties of the divided difference of psi and polygamma functions
Revista de la Real Academia de Ciencias Exactas, Físicas y Naturales. Serie A. Matemáticas, 115
C. Rao (1992)
Information and the Accuracy Attainable in the Estimation of Statistical Parameters
, 37
S. Said, L. Bombrun, Y. Berthoumieu (2017)
Warped Riemannian Metrics for Location-Scale Models
Geometric Structures of Information
S. Harris (1988)
Closed and complete spacelike hypersurfaces in Minkowski space
Classical and Quantum Gravity, 5
Y. Ollivier (2017)
True Asymptotic Natural Gradient Optimization
ArXiv, abs/1712.08449
S. Amari (1998)
Natural Gradient Works Efficiently in Learning
Neural Computation, 10
(1932)
Sur une fonctionnelle
P. O’Neill, G. Roberts (1999)
Bayesian inference for partially observed stochastic epidemics
Journal of the Royal Statistical Society: Series A (Statistics in Society), 162
M. Arnaudon, F. Barbaresco, Le Yang (2013)
Riemannian Medians and Means With Applications to Radar Signal Processing
IEEE Journal of Selected Topics in Signal Processing, 7
F. Opitz (2012)
Information geometry and its applications
2012 9th European Radar Conference
A. Peter, Anand Rangarajan (2006)
Shape analysis using the Fisher-Rao Riemannian metric: unifying shape representation and deformation
3rd IEEE International Symposium on Biomedical Imaging: Nano to Macro, 2006.
H. Alzer, J. Wells (1998)
Inequalities for the polygamma functions
Siam Journal on Mathematical Analysis, 29
Anuj Srivastava, Ian Jermyn, Shantanu Joshi (2007)
Riemannian Analysis of Probability Density Functions with Applications in Vision
2007 IEEE Conference on Computer Vision and Pattern Recognition
S. Amari
Natural Gradient Works Eciently in Learning
B. O'neill (1983)
Semi-Riemannian Geometry With Applications to Relativity
H. Karcher (1977)
Riemannian center of mass and mollifier smoothing
Communications on Pure and Applied Mathematics, 30
G. Golub (1973)
Some modified matrix eigenvalue problems
Zhenning Zhang, Huafei Sun, Fengwei Zhong (2007)
Information geometry of the power inverse Gaussian distribution
, 9
Shengping Yang, Zhide Fang (2017)
Beta approximation of ratio distribution and its application to next generation sequencing read counts
Journal of Applied Statistics, 44
Sana Rebbah, Florence Nicol, S. Puechmorel (2019)
The Geometry of the Generalized Gamma Manifold and an Application to Medical Imaging
Mathematics
Martin Bauer, Martins Bruveris, P. Michor (2014)
Uniqueness of the Fisher–Rao metric on the space of smooth densities
Bulletin of the London Mathematical Society, 48
N. Čencov (2000)
Statistical Decision Rules and Optimal Inference
S. Trimble, J. Wells, F. Wright (1989)
Superadditive functions and a statistical application
Siam Journal on Mathematical Analysis, 20
O. Calin, C. Udrişte (2014)
Geometric Modeling in Probability and Statistics
Khadiga Arwini, C. Dodson (2008)
Information Geometry: Near Randomness and Near Independence
L. Skovgaard (1984)
A Riemannian geometry of the multivariate normal model
Scandinavian Journal of Statistics, 11
Olivier Schwander, F. Nielsen (2012)
Model centroids for the simplification of Kernel Density estimators
2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Stanislav Grof, M. Transpersonal
On the Mathematical Foundations of Theoretical Statistics
Philosophical Transactions of the Royal Society A, 222
N. Bouguila, D. Ziou, Jean Vaillancourt (2004)
Unsupervised learning of a finite mixture model based on the Dirichlet distribution and its application
IEEE Transactions on Image Processing, 13
R. Madsen, David Kauchak, C. Elkan (2005)
Modeling word burstiness using the Dirichlet distribution
Proceedings of the 22nd international conference on Machine learning
N. Ay, J. Jost, H. Lê, Lorenz Schwachhöfer (2012)
Information geometry and sufficient statistics
Probability Theory and Related Fields, 162
A. Briggs, A. Ades, Martin Price (2003)
Probabilistic Sensitivity Analysis for Decision Trees with Multiple Branches: Use of the Dirichlet Distribution in a Bayesian Framework
Medical Decision Making, 23
(2002)
Gibbs sampling in the generative model of latent dirichlet allocation
J. Angulo, S. Velasco-Forero (2014)
Morphological Processing of Univariate Gaussian Distribution-Valued Images Based on Poincaré Upper-Half Plane Representation
C. Atkinson (1981)
Rao's distance measure
Sankhya, 43

ISSN: 0926-2245
eISSN: ARCH-3343
DOI: 10.1016/j.difgeo.2020.101702
Publisher site: See Article on Publisher Site

Abstract

ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL Abstract. In this paper, we study the geometry induced by the Fisher-Rao metric on the parameter space of Dirichlet distributions. We show that this space is a Hadamard manifold, i.e. that it is geodesically complete and has everywhere negative sectional curvature. An important consequence for applications is that the Fr echet mean of a set of Dirichlet distributions is uniquely de ned in this geometry. 1. Introduction The dierential geometric approach to probability theory and statistics has met increasing interest in the past years, from the theoretical point of view as well as in applications. In this approach, probability distributions are seen as elements of a dierentiable manifold, on which a metric structure is de ned through the choice of a Riemannian metric. Two very important ones are the Wasserstein metric, central in optimal transport, and the Fisher- Rao metric (also called Fisher information metric), essential in information geometry. Unlike optimal transport, information geometry is foremost concerned with parametric families of probability distributions, and de nes a Riemannian structure on the parameter space using the Fisher information matrix [14]. It was Rao who showed in 1945 [26] that the Fisher information could be used to locally de ne a scalar product on the space of parameters, and interpreted as a Riemannian metric. Later on, Cencov [13] proved that it was the only metric invariant with respect to sucient statistics, for families with nite sample spaces. This result has been extended more recently to non parametric distributions with in nite support [8, 9]. Information geometry has been used to obtain new results in statistical inference as well as gain insight on existing ones. In parameter estimation for example, Amari [3] shows that conditions for consistency and eciency of estimators can be expressed in terms of geometric conditions; in the presence of hidden variables, the famous Expectation-Maximisation (EM) algorithm can be described in an entirely geometric manner; and in order to insure invariance to dieomorphic change of parametrization, the so-called natural gradient [2] can be used to de ne accurate parameter estimation algorithms [21]. Another important use of information geometry is for the eective comparison and analy- sis of families of probability distributions. The geometric tools provided by the Riemannian framework, such as the geodesics, geodesic distance and intrinsic mean, have proved useful to interpolate, compare, average or perform segmentation between objects modeled by prob- ability densities, in applications such as signal processing [5], image [29, 4] or shape analysis [24, 31], to name a few. These applications rely on the speci c study of the geometries of usual parametric families of distributions, which has started in the early work of Atkinson and Mitchell. In [7], the authors study the trivial geometries of one-parameter families of distributions, the hyperbolic geometry of the univariate normal model as well as special cases of the multivariate normal model, a work that is continued by Skovgaard in [30]. The family of gamma distributions has been studied by Lauritzen in [19], and more recently by arXiv:2005.05608v2 [math.DG] 19 Nov 2020 2 ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL Arwini and Dodson in [6], who also focus on the log-normal, log-gamma, and families of bivariate distributions. Power inverse Gaussian distributions [35], location-scale models and in particular the von Mises distribution [28], and the generalized gamma distributions [27] have also received attention. In this work, we are interested in Dirichlet distributions, a family of probability densities de ned on the (n 1)-dimensional probability simplex, that is the set of vectors of R with non-negative components that sum up to one. The Dirichlet distribution models a random probability distribution on a nite set of size n. It generalizes the beta distribution, a two-parameter probability measure on [0; 1] used to model random variables de ned on a compact interval. Beta and Dirichlet distributions are often used in Bayesian inference as conjugate priors for several discrete probability laws [23, 16, 11], but also come up in a wide variety of other applications, e.g. to model percentages and proportions in genomic studies [33], distribution of words in text documents [20], or for mixture models [10]. Up to our knowledge, the information geometry of Dirichlet distributions has not yet received much attention. In [12], the authors give the expression of the Fisher-Rao metric for the family of beta distributions, but nothing is said about the geodesics or the curvature. In this paper, we give new results and properties for the geometry of Dirichlet distri- butions, and its sectional curvature in particular. The derived expressions depend on the trigamma function, the second derivative of the logarithm of the gamma function, however we will avoid using its properties when possible to obtain our results. Instead, we consider a more general metric written using a function f , for which we only make the strictly neces- sary assumptions. Section 2 gives the setup for our problem by considering the Fisher-Rao metric on the space of parameters of Dirichlet distributions. In Section 3, we consider the more general metric where f replaces the trigamma function, and show that it induces the geometry of a submanifold in a at Lorentzian space. This allows us to show geodesic com- pleteness, and that the sectional curvature is everywhere negative. Section 4 focuses on the two-dimensional case, i.e. beta distributions. 2. Fisher-Rao metric on the manifold of Dirichlet distributions Let denote the (n 1)-dimensional probability simplex, i.e. the set of vectors in R with non-negative components that sum up to one = fq = (q ; : : : ; q ) 2 R ; q = 1; q 0; i = 1; : : : ; ng: n 1 n i i i=1 The family of Dirichlet distributions is a family of probability distributions on parametrized by n positive scalars x ; : : : ; x > 0 (Figure 1), that admits the following probability density 1 n function with respect to the Lebesgue measure (x + : : : + x ) 1 n x 1 1 x 1 f (qjx ; : : : ; x ) = q : : : q : n 1 n 1 n (x ) : : : (x ) 1 n n n As an open subset of R , the space of parameters M = (R ) is a dierentiable manifold and can be equipped with a Riemannian metric de ned in its matrix form by the Fisher information matrix g (x ; : : : ; x ) = E log f (Qjx ; : : : ; x ) ; i; j = 1; : : : ; n; ij 1 n n 1 n @x @x i j FISHER-RAO GEOMETRY OF DIRICHLET DISTRIBUTIONS 3 Figure 1. Random samples drawn from Dirichlet distributions on the 2- dimensional simplex for dierent values of the parameters (x ; x ; x ). 3 1 2 3 where E denotes the expectation taken with respect to Q, a random variable with density f (jx ; : : : ; x ). The Dirichlet distributions form an exponential family and so the Fisher- n 1 n Rao metric is the hessian of the log-partition function [3], namely g (x ; : : : ; x ) = '(x ; : : : ; x ); i; j = 1; : : : ; n; ij 1 n 1 n @x @x i j where ' is the logarithm of the normalizing factor '(x ; : : : ; x ) = log (x ) log (x + : : : + x ): 1 n i 1 n i=1 We obtain the following metric tensor. 0 0 (1) g (x ; : : : ; x ) = (x ) (x + : : : + x ); ij 1 n i ij 1 n where is the Kronecker delta function, and denotes the digamma function, that is the ij rst derivative of the logarithm of the gamma function, i.e. (x) = log (x): dx Its derivative is called the trigamma function. As noted below, the trigamma function is a function whose reciprocal is increasing, convex, and sublinear on R . For slightly greater generality, and to emphasize what properties of this function are needed for our results, we will work in the sequel with a more general function f on which we make only the necessary assumptions; in our special case we have f = 1= . 3. The general framework 3.1. The metric. In this section we consider a more general geometry, that admits the Fisher-Rao geometry of Dirichlet distributions as a special case. The goal is to avoid using the properties of the trigamma function when possible. For this, we consider the quadrant M = (R ) equipped with a metric of the form 2 2 2 dx dx (dx + + dx ) 1 n 2 1 n (2) ds = + + ; f (x ) f (x ) f (x + + x ) 1 n 1 n where f : R ! R is a function on which we make the following assumptions: d f 2 0 2 00 (3) f (x) = O(x ); f (x) = O(x); f (x) = O(x ); f > 0 and > 0: 2 0 x!0 x!0 x!1 dx f 4 ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL We retrieve the Fisher-Rao metric (1) when f (x) = : (x) Notice that this choice for f satis es the conditions of (3). Indeed, that f (0) = f (0) = 0 0 2 comes from the asymptotic formula (x) x valid near x = 0, since 00 3 (x) 2x f (0) = lim = lim = 0: 0 2 4 x!0 x!0 (x) x The fact that the reciprocal of the trigamma function f (x) = is convex comes from (x) an argument of Trimble-Wells-Wright [32], based on an inequality later proved in Alzer- Wells [1]. The fact that f=f is convex comes from Yang [34]. Another example of a function satisfying the conditions (3) is (2x + 1)x f (x) = ; 2x + 2x + 1 a simple rational function which approximates the reciprocal of the trigamma function well, in both the small-x and large-x regions. Some useful consequences of our assumptions (3) are given in the following lemma. These results are well-known, but we include the simple proofs for completeness. Lemma 1. If f satis es (3), then we have f (x) > 0 and f (x) > 0 for all x > 0. In addition f and f=f are superadditive: (4) f (x + + x ) > f (x ) + + f (x ); 1 n 1 n f (x + + x ) f (x ) f (x ) 1 n 1 n (5) > + + ; 0 0 0 f (x + x ) f (x ) f (x ) 1 n 1 n for all x ; : : : ; x > 0. 1 n 00 0 Proof. That f > 0 implies f > 0 and thus f > 0 for all x > 0 is obvious. It has been known since Petrovich [25] that a convex function f with f (0) = 0 is superadditive: an easy argument in the dierentiable case is that Z Z x y f (x + y) f (x) f (y) = f (s + t) dt ds 0: 0 0 By induction the general case (4) follows. Since lim f (x)=f (x) = 0, the same argument x!0 applies to f=f to give (5). 3.2. Lorentzian submanifold geometry. We now show that after a change of coor- dinates, M can be seen as a codimension 1 submanifold of the (n + 1)-dimensional at n+1 n+1 2 Minkowski space L = (R ; ds ), where 2 2 2 2 (6) ds = dy + : : : + dy dy : L 1 n n+1 In the sequel, we will denote by h;i the scalar product induced by this metric. Proposition 2. The mapping n+1 : M ! L ; (x ; : : : ; x ) 7! ((x ); : : : ; (x ); (x + : : : + x )) 1 n 1 n 1 n where : R ! R is de ned by dr (x) = p ; f (r) 1 FISHER-RAO GEOMETRY OF DIRICHLET DISTRIBUTIONS 5 is an isometric embedding. 1 00 2 Proof. Since f (x) f (0)x for x 0, we see that dr p = 1; f (r) so that the image of must include all negative reals. Therefore maps R bijectively to (1; N ) for some N 2 (0;1]. The behavior of f at in nity assumed in (3) implies that f (x) Cx for all x K , for some K > 0, C > 0, which in turns leads to dx p = 1: f (x) Therefore maps bijectively R to R, and is a homeomorphism onto its image. Since (x) > 0 for all x, it is also an immersion. Finally, if (y ; : : : ; y ) = (x ; : : : ; x ), 1 n+1 1 n 2 2 dx (dx + : : : + dx ) 1 n 2 0 2 2 i 2 dy = (x ) dx = ; i = 1; : : : ; n; dy = ; i i n+1 f (x ) f (x + : : : + x ) i 1 n and is isometric. n+1 Proposition 3. S = (M ) is a codimension 1 submanifold of L given by the graph of (7) y = ((y ) + : : : + (y )); y > 0; n+1 1 n i where = . On this submanifold the metric is positive-de nite and thus Riemannian. A basis of tangent vectors of T S is de ned by @ f (y ) @ (8) e = + ; i = 1; : : : ; n; @y f (y ) @y i n+1 n+1 Proof. Let (u) = (y (u); : : : ; y (u)) be a parametrized curve in S. Then its coordinates 1 n+1 verify the following relations y = ((y ) + : : : + (y )); n+1 1 n 0 0 0 0 0 0 y = ((y ) + : : : + (y ))( (y )y + : : : + (y )y ); 1 n 1 n n+1 1 n and so, since (x) = f ((x)), @ @ 0 0 0 0 0 0 0 (u) = y (u) + ((y (u)))( (y (u))y (u) + : : : + (y (u))y (u)) n+1 1 n i 1 n @y @y i n+1 i=1 @ f (y (u)) @ = y (u) + p ; @y @y i f (y (u)) n+1 n+1 i=1 yielding (8) as basis tangent vectors. The metric components on S take the form g = he ; e i = W W ; or g = I WW ; ij i j ij i j where h;i denotes the at Minkowskian metric (6) and W = f ((y ))=f ((y )) for i i n+1 i = 1; : : : ; n. Applying Lemma 16 of the appendix gives the result upon computing f (y ) i=1 W W = P < 1; f (y ) i=1 by superadditivity of f , as in (4). 6 ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL In other words, the metric (2) is the restriction of the at Lorentzian metric 2 2 2 dx dx dt 2 1 n ds = + + f (x ) f (x ) f (t) 1 n to the hyperplane t = x + + x . In the sequel, we will use this Lorentzian submanifold 1 n geometry to study the sectional curvature and geodesic completeness of M . We state the results in the original coordinate system of M when possible, using the following notations for any y = (y ; : : : ; y ) 2 S: 1 n+1 (9) x = (y ); i = 1; : : : ; n; t = x + : : : + x = (y ): i i 1 n n+1 3.3. Negative sectional curvature. The goal of this section is to prove that the sectional curvature of M is everywhere negative. We start by computing the shape operator. Proposition 4. The shape operator of S = (M ) has the following components in the basis (8) of tangent vectors 1 f (t) h(e ); e i = p f (x ) f (x )f (x ) : i j i ij i j f (t) 2 f (t) f (x ) `=1 Proof. We rst observe that the basis vectors (8) can be expressed in coordinates (9) as @ f (x ) @ (10) e = + ; i = 1; : : : ; n: @y f (t) @y i n+1 Since S can be obtained as the graph of F (y ; : : : ; y ) = ((y ); : : : ; (y )), a normal vector 1 n 1 n eld to S at y is given by n n X X @F @ @ f (x ) @ @ (11) N = + = + ; @y @y @y f (t) @y @y i i n+1 i n+1 i=1 i=1 which yields a timelike vector since (12) hN; Ni = (f (x ) + : : : + f (x ) f (t)) < 0; 1 n f (t) by superadditivity of f . Since hN; e i = 0, the shape operator is then given by N hr N; e i e j (13) h(e ); e i = hr p ; e i = p ; i j e j hN; Ni hN; Ni r is the at connection of the Minkowski space. Denoting @ = @=@y , we get from (10), i i (11) and the atness of r, f (x ) r N = r N + r N = @ @ F@ : e @ @ i j j i i n+1 f (t) j=1 Inserting this last equation along with (12) into (13) yields f (t) h(e ); e i = @ @ F: i j i j f (t) f (x ) `=1 FISHER-RAO GEOMETRY OF DIRICHLET DISTRIBUTIONS 7 Straightforward computations give 0 0 @ F = (t) (y ) = f (x )=f (t); i i i 1 f (t) 00 0 0 0 00 0 @ @ F = (t) (y ) (y ) + (t) (y ) = p f (x )f (x ) + f (x ) ; i j i j i ij i j i ij f (t) 2 f (t) and the result follows after simpli cation. Corollary 5. The second fundamental form given by Proposition 4 is positive-de nite. Proof. This follows from Lemma 16 and the decomposition of the matrix with components = h(e ); e i as ij i j = k(D cV V ); where D = diag(d ; : : : ; d ) is a diagonal matrix, V = (v ) is a column vector and c 1 n i 1in and k are constants, de ned for i = 1; : : : ; n by 1 f (t) (14) d = f (x ); v = f (x ); k = p ; c = : i i i i P f (t) f (t) f (x ) `=1 Recalling that f > 0 and f > 0 by Lemma 1, we see that the matrix D and constant c are positive. There remains to verify that f (t) f (x ) T 1 cV D V = < 1; f (t) f (x ) i=1 by the superadditivity property (5). We can now show our main result. Theorem 6. The sectional curvature of the Riemannian metric (2) is negative on M . Proof. We use a result from O'Neill [22, Chapter 4, Corollary 20], which states that if the normal vector eld N of a hypersurface M in a at Lorentzian manifold L is timelike, then the sectional curvature of the submanifold is given by h(U ); Uih(V ); Vih(U ); Vi (15) K (U; V ) = ; hU; UihV; VihU; Vi where U and V are tangent to the submanifold and is the shape operator. The result now follows by the Cauchy-Schwarz inequality: since is a positive-de nite symmetric matrix, we know that h(U ); Uih(V ); Vi h(U ); Vi with equality i V is a multiple of U , but in that case the denominator vanishes as well. So the sectional curvature must be strictly negative. We now give more speci cally the formula of the sectional curvature of the planes gener- ated by the basis tangent vectors (8). Proposition 7. The sectional curvature along the axes de ned by (8) is given by 0 0 0 0 0 0 f (x )f (x )f (t) + f (x )f (x )f (t) f (x )f (x )f (t) i j i j i j K (e ; e ) = : i j n 4(f (t) f (x ))(f (t) f (x ) f (x )) ` i j `=1 8 ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL Proof. This follows from applying formula (15) for the sectional curvature of a hypersurface in a at Lorentzian manifold, with f (x ) f (x )f (x ) i i j he ; e i = 1 ; he ; e i = ; i 6= j; i i i j f (t) f (t) and h(e ); e i given by Proposition 4. i j Finally, we state a result about the eigenvalues of the shape operator, which will be useful to show geodesic completeness in the next section. Proposition 8. The principal curvatures at any point in S = (M ) are bounded. Proof. The principal curvatures at a given point (y ; : : : ; y ) = (x ; : : : ; x ) 2 S are the 1 n+1 1 n eigenvalues of the shape operator = k(D cV V ); where D, V , c and k are de ned by (14). Without loss of generality, we assume that the n-tuple (x ; : : : ; x ) is ordered. Let us rst show that when at least n 1 variables go to 1 n zero, i.e. x ! 0 for i = 1; : : : ; n 1 with the previous assumption, the principal curvatures go to zero. Let = x + : : : + x , then t = x + and 1 n1 n f (t) f (x ) : : : f (x ) f (x ) 1 n n !0 since f has limit zero in zero, and so k p : !0 f (x ) Using the fact that when ! 0, recalling assumptions (3), 2 0 f (x ) = O(x ); f (x ) = O(x ); x = O( ); i = 1; : : : ; n 1; i i i i we see that the diagonal terms of D cV V behave as f (t) f (x ) f (x ) = O( ); i = 1; : : : ; n 1; i i f (t) 0 0 0 0 2 f (t) f (x )f (x + ) f (x + )f (x ) f (x ) n n n n n 0 00 f (x ) f (x ) = f (x ) + ; n n n !0 f (t) f (x + ) f (x ) n n while the antidiagonal terms verify f (t) f (x )f (x ) = O( ); 1 i; j n 1; i j f (t) f (t) f (x )f (x ) = O( ): i n f (t) Finally, we obtain that = h(e ); e i = O( ); 1 i; j n; ij i j !0 and so the principal curvatures go to zero when ! 0. Therefore there exists > 0 such that, at any point (x ; : : : ; x ) belonging to the set 1 n D = f(x ; : : : ; x ) 2 (R ) ; x < for at least n 1 indices i 2 f1; : : : ; ngg; 1 n i + FISHER-RAO GEOMETRY OF DIRICHLET DISTRIBUTIONS 9 the principal curvatures are upper bounded by, say, 1. Now let us consider an n-tuple (x ; : : : ; x ) 2= D , ordered as before. Then the diagonal elements of D are ordered as well 1 n 0 T since f is increasing, and the ordered eigenvalues of k(D cV V ) verify 0 : : : kd ; 1 n n where the lower bound comes from the positive-de niteness shown in Corollary 5, and the 0 0 upper bound comes from [15]. Since d = f (x ) and f is increasing and upper bounded n n by lim f (x) = 1, we have that d 1. Since the function x!1 (x ; : : : ; x ) 7! f (x + : : : + x ) f (x ) : : : f (x ) 1 n 1 n 1 n is increasing in all of its variables, it is larger than its limit as the rst n 2 variables go to zero, and since at least x > and x > , we obtain n1 n 1 1 p p k = ; f (t) f (x ) : : : f (x ) f (2) 2f () 1 n and the principal curvatures are again bounded. 3.4. Geodesics and geodesic completeness. The geodesics of M for the metric (2) are parametrized curves u 7! (x (u); : : : ; x (u)) solution of the standard second-order ODEs 1 n x + x _ x _ = 0; k = 1; : : : ; n; k i j ij 1i;jn whose coecients can be computed using the following result. Proposition 9. The Christoel symbols for metric (2) are given by 1 f (x ) = (g(t) g(x ) ) g(x ) ; n j ij k ij jk ij 2 f (t) f (x ) `=1 where t = x +: : : +x and g(x) = f (x)=f (x), while denotes the Kronecker delta function. 1 n Proof. The Christoel symbols of the second kind can be obtained from the Christoel ij ij symbols of the rst kind and the coecients g of the inverse of the metric matrix ijk using the formula k kl = g ; ijl ij where we have used the Einstein summation convention. It is easy to see that the Christoel symbols of the rst kind are given by 0 0 1 f (t) f (x ) = : ijk ik jk 2 2 2 f (t) f (x ) Applying the Sherman-Morrison formula, we obtain that the inverse of the metric matrix 1 1 1 g(x ; : : : ; x ) = diag ; : : : ; J; 1 n f (x ) f (x ) f (t) 1 n where J denotes the n-by-n matrix with all entries equal to one, is given by 1 1 (16) g(x ; : : : ; x ) = diag(f (x ); : : : ; f (x ) ) + P [f (x )f (x )] : 1 n 1 n n i j 1i;jn f (t) f (x ) `=1 Noticing that the sum of all the elements of the kth line (or column) of the inverse of the metric matrix is given by f (x )f (x ) f (x )f (t) k ` k k` `=1 P P (17) g = f (x ) + = ; ` n n f (t) f (x ) f (t) f (x ) ` ` `=1 `=1 `=1 10 ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL we obtain n n 0 0 0 0 X X 1 f (t) f (x ) 1 f (t) 1 f (x ) ` j k k` k` kj = g = g g : ij j` ij ij 2 2 2 2 2 f (t) f (x ) 2 f (t) 2 f (x ) ` j `=1 `=1 Inserting (17) and the general term of the inverse matrix (16) in the above yields 0 0 1 f (t) f (x )f (t) 1 f (x ) f (x )f (x ) k j k j P P = f (x ) + ij k kj ij n n 2 2 2 f (t) f (t) f (x ) 2 f (x ) f (t) f (x ) ` j ` `=1 `=1 and the result follows. Now, using the result of Proposition 8 and a theorem from [17], we can show that M is geodesically complete. Theorem 10. M equipped with the Riemannian metric (2) is geodesically complete. n+1 Proof. The image of M by is a hypersurface of the (n + 1)-Minkowski space L . More- n+1 over, is an embedding and it is closed since (M ) is a closed subset of L as preimage of the singleton f0g by the continuous map (y ; : : : ; y ) 7! (y ) + : : : + (y ) (y ). 1 n+1 1 n n+1 Therefore is proper [17, Theorem 1]. Then, [17, Theorem 6] allows us to conclude that since has bounded principal curvatures by Proposition 8, M equipped with the pullback (2) of the Minkowski metric by is complete. 3.5. Uniqueness of the Fr echet mean. Since M is simply connected, we deduce from Theorems 6 and 10 the following. Corollary 11. M equipped with the Riemannian metric (2) is a Hadamard manifold. This has important implications in information geometry, as it guarantees the uniqueness of the Fr echet mean of a set of points in this geometry. The Fr echet mean, also called intrinsic mean, is a popular choice to extend the notion of barycenter to a Riemannian manifold. It is de ned for a set of points p ; : : : ; p 2 M as the minimizer of the sum of the squared 1 N geodesic distances to the points of the set p = argmin d(p; p ) : p2M i=1 It exists as long as M is complete, however it is in general not unique and refers to a set. Uniqueness holds however for Hadamard manifolds [18]. This implies that the notion of barycenter of Dirichlet distributions is well de ned in the Fisher-Rao geometry. 4. The two-dimensional case of beta distributions The simplest case is obviously when n = 2, and even in this case the formulas are nontrivial. When f = , the metric comes from the well-known two-parametric family of beta distributions de ned on the compact interval [0; 1], which is important in statistics and useful in many applications. Proposition 12. The geodesic equations are given by 2 2 a(x; y)x + b(x; y)x _ + c(x; y)x _y _ + d(x; y)y _ = 0; (18) 2 2 a(y; x)y + b(y; x)y _ + c(y; x)x _y _ + d(y; x)x _ = 0; FISHER-RAO GEOMETRY OF DIRICHLET DISTRIBUTIONS 11 Figure 2. On the left, geodesic balls and on the right, sectional curvature of the manifold of beta distributions (n = 2). Figure 3. On the left, geodesic between the beta distributions of parame- ters (2; 5) and (2; 2) and on the right, Fr echet mean (full red line) compared to the Euclidean mean (dashed red line) of the beta distributions of param- eters (2; 5), (2; 2) and (5; 1), shown in terms of probability density function. where a(x; y) = 2 f (x + y) f (x) f (y) b(x; y) = f (y)g(x) + f (x)g(x + y) f (x + y)g(x) c(x; y) = 2f (x)g(x + y) d(x; y) = f (x)g(x + y) g(y)f (x); with the shorthand g(x) = f (x)=f (x). Proof. The geodesic equations can be expressed in terms of the Christoel symbols as 1 2 1 1 2 x + x _ + 2 x _y _ + y _ = 0; 11 12 22 2 2 2 2 2 y + y _ + 2 x _y _ + x _ = 0; 22 12 11 and the coecients can be computed using Proposition 9. No closed form is known for the geodesics, but they can be computed numerically by solving (18), see the left-hand side of Figure 2. Nonetheless we can notice that, due to the symmetry of the metric with respect to parameters x and y, both equations in (18) yield a unique ordinary dierential equation when x = y. 12 ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL Corollary 13. Solutions of the geodesic equation (18) with x(0) = y(0) and x _ (0) = y _ (0) satisfy (19) q(x(t))x _ (t) = constant; 1 2 where q(x) = , and thus can be found by quadratures. f (x) f (2x) Proof. If at some time t we have x = y and x _ = y _ , then the equations (18) imply that x = y at t . Dierentiating repeatedly in time shows that all higher derivatives must also be equal at t , and we conclude by analyticity of the solutions that x(t) = y(t) on some interval. The usual extension arguments for ODEs then imply that x(t) = y(t) on the entire domain of the solution, which by Theorem 10 is R. When x = y equation (18) reduces to 0 0 4f (x)f (2x) f (2x)f (x) 2 f (2x) 2f (x) x + x _ = 0; f (2x) f (x) which is equivalent to 0 2 2q(x)x + q (x)x _ = 0: This clearly implies the conservation law (19). The dierential equation (19) can then be solved by writing t = p q(s) ds x _ q(x ) 0 0 0 and inverting the resulting function. For example, if f (x) = 1= (x), then the duplication formula for the trigamma function implies 0 0 1 0 0 1 q(x) = (x) 2 (2x) = [ (x) (x + )]: 2 2 1 1 Asymptotically this looks like q(x) for x 0 and q(x) as x ! 1. We conclude 2 2 2x 4x that it takes in nite time for a geodesic along the diagonal to either reach \diagonal in nity" or the origin, as Theorem 10 of course implies. From an applications point of view, the geodesics for the Fisher-Rao geometry allow us to de ne a notion of optimal interpolation between beta and more generally Dirichlet distributions. An example of such an optimal interpolation is shown on the left-hand side of Figure 3; in terms of probability density function. Now we give the formula for the sectional curvature in two dimensions. Proposition 14. If n = 2, the sectional curvature is given by 0 0 0 0 0 0 1 f (t)f (x)f (y) f (x)f (t)f (y) f (y)f (t)f (x) (20) K (x; y) = ; f (t) f (x) f (y) where t = x + y. Proof. This is just a particular case of Proposition 7. Notice that in two dimensions, the negativity of the sectional curvature is straightforward, as there is only one Gaussian curvature to consider, which is given by (20), in which one 0 0 0 can easily see that the numerator is positive by factorizing by f (x)f (y)f (t) > 0 and using the superadditivity property (5) of f=f . As previously mentioned, the negative curvature of the Fisher-Rao geometry also has interesting implications for applications: it entails that the Fr echet mean of a set of beta, or more generally Dirichlet distributions is well de ned. An example of Fr echet mean of FISHER-RAO GEOMETRY OF DIRICHLET DISTRIBUTIONS 13 Figure 4. The dierence (23) between the sectional curvatures of the plane generated by e and e in two and three dimensions changes sign for z = 1 2 0:01. beta distributions is shown in terms of probability density function on the right-hand side of Figure 3. Numerically we observe that when f = 1= , the function K (x; y) given by (20) is decreasing in both the x and y variables { see the right-hand side of Figure 2 { but we do not yet have a proof of this fact. However we may analyze the asymptotics of the function relatively easily. Proposition 15. If f = 1= , then the asymptotic behavior of the sectional curvature given by (20) approaching the boundary square is given by 0 000 3 (x) (x) (21) lim K (x; y) = lim K (y; x) = ; 00 2 y!0 y!0 4 2 (x) 00 0 x (x) + (x) (22) lim K (x; y) = lim K (y; x) = : 0 2 y!1 y!1 4(x (x) 1) Moreover, we have the following limits at the asymptotic corners: 1 1 lim K (x; y) = 0; lim K (x; y) = ; lim K (x; y) = lim K (x; y) = : x;y!0 x;y!1 2 x!0;y!1 x!1;y!0 4 A(x;y) Proof. Writing K (x; y) = , with 4B(x;y) 0 0 0 0 0 0 A(x; y) = f (x + y)f (x)f (y) f (x)f (x + y)f (y) f (y)f (x + y)f (x); B(x; y) = f (x + y) f (x) f (y); we note that A(x; 0) = M (x; 0) = 0 and N (x; 0) = 0, so that A (x; 0) yy lim K (x; y) = ; y!0 8B (x; 0) which gives (21) after rewriting in terms of . 1 0 For the in nite limits, we use the facts that lim f (y) y = and lim f (y) = 1, y!1 y!1 and that lim y(f (y) 1) = 0, to obtain limits of A(x; y) and B(x; y) separately with y!1 elementary computations. These limits and strong numerical evidence allow us to conjecture that the sectional curvature in two dimensions is lower bounded by 1=2. Comparing the two-dimensional sectional curvature K (x; y) = K (x; y) with the sectional curvature of the plane generated 14 ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL by e and e in three dimensions, that we denote by K (x; y; z), we observe numerically 1 2 3 that for a given z > 0, the function (23) (x; y) 7! K (x; y; z) K (x; y) 3 2 does not have a xed sign in general, as can be observed on Figure 4 for small values of x, y and z. Acknowledgments S. C. Preston was partially supported by Simons Foundation, Collaboration Grant for Mathematicians, no. 318969. A. Le Brigant and S. Puechmorel would like to thank Fabrice Gamboa and Thierry Klein for bringing this problem to their attention and for fruitful discussions. Appendix Here we give a well-known principle to establish positivity of matrices. Lemma 16. Suppose A is a positive-de nite symmetric matrix, V is a vector, and c is a positive real number. Then B = A cV V is positive-de nite if and only if T 1 cV A V < 1: Proof. Since A is positive-de nite and symmetric, we may write A = P for some positive- de nite symmetric matrix P . Let X = P V ; then we may write 2 T 1 1 T T B = P cV V = P I c(P V )(P V ) P = P (I cXX )P: T n Denoting by hUjUi = U U the usual scalar product on R , we have for any vector U , 2 2 2 hUjBUi = hPUjPUi chPUjXi = jYj chYjXi 2 2 2 2 2 jYj cjXj jYj = jYj (1 cjXj ); where Y = PU , using the Cauchy-Schwarz inequality. This is positive for all U if and 2 2 only if the right side is positive for all Y , which translates into cjXj < 1. Since jXj = 1 1 1 hP VjP Vi = hVjA Vi, we obtain the claimed result. References [1] Horst Alzer and Jim Wells. Inequalities for the polygamma functions. SIAM Journal on Mathematical Analysis, 29(6):1459{1466, 1998. [2] Shun-Ichi Amari. Natural gradient works eciently in learning. Neural computation, 10(2):251{276, [3] Shun-ichi Amari. Information geometry and its applications, volume 194. Springer, 2016. [4] Jesus Angulo and Santiago Velasco-Forero. Morphological processing of univariate gaussian distribution- valued images based on poincar e upper-half plane representation. In Geometric Theory of Information, pages 331{366. Springer, 2014. [5] Marc Arnaudon, Fr ed eric Barbaresco, and Le Yang. Riemannian medians and means with applications to radar signal processing. IEEE Journal of Selected Topics in Signal Processing, 7(4):595{604, 2013. [6] Khadiga Arwini and Christopher TJ Dodson. Information geometry: near randomness and near inde- pendence. Springer Science & Business Media, 2008. [7] Colin Atkinson and Ann FS Mitchell. Rao's distance measure. Sankhy a: The Indian Journal of Statis- tics, Series A, pages 345{365, 1981. [8] Nihat Ay, Jurgen Jost, H^ ong V^ an L^ e, and Lorenz Schwachh ofer. Information geometry and sucient statistics. Probability Theory and Related Fields, 162(1-2):327{364, 2015. [9] Martin Bauer, Martins Bruveris, and Peter W Michor. Uniqueness of the Fisher{Rao metric on the space of smooth densities. Bulletin of the London Mathematical Society, 48(3):499{506, 2016. FISHER-RAO GEOMETRY OF DIRICHLET DISTRIBUTIONS 15 [10] Nizar Bouguila, Djemel Ziou, and Jean Vaillancourt. Unsupervised learning of a nite mixture model based on the dirichlet distribution and its application. IEEE Transactions on Image Processing, 13(11):1533{1543, 2004. [11] Andrew H Briggs, AE Ades, and Martin J Price. Probabilistic sensitivity analysis for decision trees with multiple branches: use of the dirichlet distribution in a bayesian framework. Medical Decision Making, 23(4):341{350, 2003. [12] Ovidiu Calin and Constantin Udri ste. Geometric modeling in probability and statistics. Springer, 2014. [13] Nikolai Nikolaevich Cencov. Statistical decision rules and optimal inference. transl. math. Monographs, American Mathematical Society, Providence, RI, 1982. [14] Ronald A Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 222(594-604):309{368, 1922. [15] Gene H Golub. Some modi ed matrix eigenvalue problems. Siam Review, 15(2):318{334, 1973. [16] Tom Griths. Gibbs sampling in the generative model of latent dirichlet allocation. 2002. [17] Stephen G Harris. Closed and complete spacelike hypersurfaces in minkowski space. Classical and Quantum Gravity, 5(1):111, 1988. [18] H. Karcher. Riemannian center of mass and molli er smoothing. Communications on pure and applied mathematics, 30(5):509{541, 1977. [19] Stefan L Lauritzen. Statistical manifolds. Dierential geometry in statistical inference, 10:163{216, [20] Rasmus E Madsen, David Kauchak, and Charles Elkan. Modeling word burstiness using the dirichlet distribution. In Proceedings of the 22nd international conference on Machine learning, pages 545{552, [21] Yann Ollivier. True asymptotic natural gradient optimization. arXiv preprint arXiv:1712.08449, 2017. [22] Barrett O'neill. Semi-Riemannian geometry with applications to relativity. Academic press, 1983. [23] Philip D O'Neill and Gareth O Roberts. Bayesian inference for partially observed stochastic epidemics. Journal of the Royal Statistical Society: Series A (Statistics in Society), 162(1):121{129, 1999. [24] Adrian Peter and Anand Rangarajan. Shape analysis using the sher-rao riemannian metric: Unifying shape representation and deformation. In 3rd IEEE International Symposium on Biomedical Imaging: Nano to Macro, 2006., pages 1164{1167. IEEE, 2006. [25] M Petrovich. Sur une fonctionnelle. Publ. Math. Beograd, TL, 1932. [26] C Radhakrishna Rao. Information and the accuracy attainable in the estimation of statistical parame- ters. Bull. Calcutta Math. Soc., 37, 01 1945. [27] Sana Rebbah, Florence Nicol, and St ephane Puechmorel. The geometry of the generalized gamma manifold and an application to medical imaging. Mathematics, 7(8):674, 2019. [28] Salem Said, Lionel Bombrun, and Yannick Berthoumieu. Warped riemannian metrics for location-scale models. In Geometric Structures of Information, pages 251{296. Springer, 2019. [29] Olivier Schwander and Frank Nielsen. Model centroids for the simpli cation of kernel density estimators. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 737{740. IEEE, 2012. [30] Lene Theil Skovgaard. A Riemannian geometry of the multivariate normal model. Scandinavian Journal of Statistics, pages 211{223, 1984. [31] Anuj Srivastava, Ian Jermyn, and Shantanu Joshi. Riemannian analysis of probability density functions with applications in vision. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1{8. IEEE, 2007. [32] SY Trimble, Jim Wells, and FT Wright. Superadditive functions and a statistical application. SIAM journal on mathematical analysis, 20(5):1255{1259, 1989. [33] Shengping Yang and Zhide Fang. Beta approximation of ratio distribution and its application to next generation sequencing read counts. Journal of applied statistics, 44(1):57{70, 2017. [34] Zhen-Hang Yang. Some properties of the divided dierence of psi and polygamma functions. Journal of Mathematical Analysis and Applications, 455(1):761 { 777, 2017. [35] Zhenning Zhang, Huafei Sun, and Fengwei Zhong. Information geometry of the power inverse gaussian distribution. Applied Sciences, 9, 2007. 16 ALICE LE BRIGANT, STEPHEN C. PRESTON, AND STEPHANE PUECHMOREL SAMM 4543, Universite Paris 1 Pantheon Sorbonne, Centre PMF, Paris, France. Email address : alice.le-brigant@univ-paris1.fr Department of Mathematics, Brooklyn College and CUNY Graduate Center, New York, USA. Email address : stephen.preston@brooklyn.cuny.edu Ecole Nationale de l'Aviation Civile, Universite de Toulouse, Toulouse, France. Email address : stephane.puechmorel@enac.fr

Journal

Mathematics – arXiv (Cornell University)

Published: May 12, 2020

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Fisher-Rao geometry of Dirichlet distributions

Fisher-Rao geometry of Dirichlet distributions

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Fisher-Rao geometry of Dirichlet distributions

Fisher-Rao geometry of Dirichlet distributions

References (36)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies