Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A flexible multivariate model for high-dimensional correlated count data

A flexible multivariate model for high-dimensional correlated count data aschissler@unr.edu Department of Mathematics & We propose a flexible multivariate stochastic model for over-dispersed count data. Our Statistics, University of Nevada, methodology is built upon mixed Poisson random vectors (Y , ... , Y ),where the {Y } 89557 Reno, USA 1 i are conditionally independent Poisson random variables. The stochastic rates of the {Y } are multivariate distributions with arbitrary non-negative margins linked by a copula function. We present basic properties of these mixed Poisson multivariate distributions and provide several examples. A particular case with geometric and negative binomial marginal distributions is studied in detail. We illustrate an application of our model by conducting a high-dimensional simulation motivated by RNA-sequencing data. Keywords: Multivariate count data, Copula, Distribution theory, Big data applications, Gamma-Poisson hierarchy, Mixed Poisson distribution, Negative binomial distribution, High-dimensional multivariate simulation, RNA-sequencing data AMS Subject Classification: 62E10; 62E15; 62H05; 62H10; 62H30 1 Introduction As multivariate count data become increasingly common across many scientific disci- plines, including economics, finance, geosciences, biology, marketing, and others, there is a growing need for flexible families of multivariate distributions with discrete mar- gins. In particular, flexible models with correlated classical marginal distributions are in high demand in many different applied areas (see, e.g., Barbiero and Ferrari (2017); Mad- sen and Birkes (2013); Madsen and Dalthorp (2007); Nikoloulopoulos and Karlis (2009); Xiao (2017)). With this in mind, we propose a general method of constructing discrete multivariate distributions with certain common marginal distributions. One important example of this construction is a discrete multivariate model with correlated negative binomial (NB) components and arbitrary parameters. However, our approach is quite general and can produce families with different margins, going beyond the NB case. One way to generate multivariate distributions with particular margins is an approach through copulas (see, e.g., Nelsen (2006)), and multivariate discrete distributions con- structed through this method have been proposed in recent years (see, e.g., Barbiero and Ferrari (2017); Madsen and Birkes (2013); Nikoloulopoulos (2013); Nikoloulopoulos and Karlis (2009); Xiao (2017) and references therein). Recall that a copula is a cumulative © The Author(s). 2021 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 2 of 21 distribution function (CDF) on [ 0, 1] , describing a random vector with standard uni- form margins. Moreover, for any random vector X = (X , ... , X ) with the joint CDF F and marginal CDFs F there is a copula function C(u , ... , u ) so that i 1 d F(x , ... , x ) = P(X ≤ x , ... , X ≤ x ) = C(F (x ), ... , F (x )), x ∈ R, i = 1, ... , d. 1 d 1 1 d d 1 1 d d i (1) Further, for continuous distributions with marginal probability density functions (PDFs) f (x) = F (x), the copula function C is unique, and the joint PDF of the {X } is given by i i ⎧ ⎫ ⎨ ⎬ f (x , ... , x ) = f (x ) c(F (x ), ... , F (x )), x ∈ R, i = 1, ... , d,(2) 1 d i i 1 1 d d i ⎩ ⎭ i=1 where the function c(u , ... , u ) is the PDF corresponding to the copula C(u , ... , u ). 1 d 1 d However, for discrete distributions, the copula is no longer unique and there is no ana- logue of (2) for calculating the relevant probabilities. Using this concept, one can define a random vector Y = (Y , ... , Y ) in R with arbitrary marginal CDFs F viz. 1 i −1 −1 (Y , ... , Y ) = F (U ), ... , F (U ),(3) 1 d 1 d 1 d where U = (U , ... , U ) is a random vector with standard uniform margins and the 1 d CDF given by F (u , ... , u ) = P(U ≤ u , ... , U ≤ u ) = C(u , ... , u ), (u , ... , u ) ∈[0,1] , U 1 n 1 1 1 1 d d d d (4) with a particular copula C. While one can use any of the multitude of different copula functions in this construction, an approach based on Gaussian copula, known as NORTA (NORmal To Anything, see, e.g., Chen (2001); Song and Hsiao (1993)), is especially popu- lar due to its flexibility, particularly in the case of discrete distributions (see, e.g., Barbiero and Ferrari (2017); Madsen and Birkes (2013); Nikoloulopoulos (2013)). While our approach involves copulas as well, the latter connect with continuous multi- variate distributions rather than discrete, which avoids the issues with non-uniqueness of the copula function. Additionally, compared with the direct approach (3), in our scheme the computation of relevant probabilities is straightforward. Our methodology is based on mixtures of Poisson distributions, which is a common way of obtaining discrete analogs of continuous distributions on nonnegative reals with a particular stochastic interpretation. Indeed, discrete univariate mixed Poisson distributions have been proven useful stochas- tic models in many scientific fields (see, e.g., Karlis and Xekalaki (2005), where one can find a comprehensive review of these distributions with over 30 particular examples). This construction can be described through a randomly stopped Poisson process. More pre- cisely, let {N (t), t ∈ R } be a homogeneous Poisson process with rate λ> 0, so that the marginal distribution of N (t) is Poisson with parameter (mean) λt. Then, for any random variable T with cumulative distribution function (CDF) F ,supported on R ,the quan- T + tity Y = N (T ) is an integer-valued random variable, with distribution determined viz. standard conditioning argument as follows: −λt n e (λt) P(Y = n) = dF (t), n ∈ N ={0, 1, ...}.(5) T 0 n! + (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 3 of 21 Many standard probability distributions on N arise from this scheme. In particular, if T has a standard gamma distribution with shape parameter r > 0, given by the PDF r−1 −x f (x) = x e , x ∈ R,(6) (r) then Y = N (T ) will have a NB distribution NB(r, p) with the probability mass function (PMF) (n + r) r n P(Y = n) = p (1 − p) , n ∈ N,(7) (r)n! where p = 1/(1+λ) (see Section 3.2 in the Appendix). As the NB model is quite important across many applications and can be extended to more general stochastic processes (see, e.g., Kozubowski and Podgórski (2009)), it shall serve as a basic example of our approach. An extension of this scheme to the multivariate case can be accomplished in two differ- ent ways, leading to mixed multivariate Poisson distributions of Kind (or Type) I and II in the terminology of Karlis and Xekalaki (2005). The former arises viz. Y = (Y , ... , Y ) = (N (T ), ... , N (T )),(8) 1 d 1 d where the {N (·)} are Poisson processes with rates λ and T is, as before, a random vari- i i able on R , independent of the {N }. While in general the marginal distributions of + i (N (t), ... , N (t)) can be correlated multivariate Poisson (see, Johnson et al. (1997)), we 1 d shall assume that the processes {N } are mutually independent. In this case, the joint probability generating function (PGF) of the {Y } in (8)isofthe form ⎧ ⎫ ⎛ ⎞ d d d ⎨ ⎬ i  d ⎝ ⎠ G(s , ... , s ) = E s = φ λ − λ s , (s , ... , s ) ∈[0,1] ,(9) 1 d T i i i 1 d ⎩ ⎭ i=1 i=1 i=1 where φ is the Laplace transform (LT) of T, while the relevant probabilities can be conveniently expressed as P(Y = y) = g (y )h(y), y = (y , ... , y ) ∈ N , (10) i i 1 i=1 where the {g } in (10)are the marginal PMFs of the {Y }. As shown in the Appendix, the i i function h is of the form d d v y , λ T i i i=1 i=1 h(y) = , y = (y , ... , y ) ∈ N , (11) v (y , λ ) T i i i=1 where −λT y v (y, λ) = E e T , λ, y ∈ R . (12) T + In case of gamma distributed T, with shape parameter r > 0 and unit scale, the functions v and h above can be evaluated explicitly (see the Appendix for details), and the above distribution is know in the literature as the multivariate negative multinomial distribu- tion (see Chapter 36 of Johnson et al. (1997) and references therein). Since the marginal distributions in this case are NB, the distribution has also been termed multivariate NB. In cases where the function v(·, ·) is not available explicitly, it can be easily evaluated numerically, viz. Monte Carlo simulations. Our main focus will be a more flexible family of mixed Poisson distributions of Kind II, where each Poisson process {N (t), t ∈ R } is randomly stopped at a different random i + variable T , leading to i (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 4 of 21 Y = (Y , ... , Y ) = (N (T ), ... , N (T )), (13) 1 1 1 d d d where the random vector T = (T , ... , T ) follows a multivariate distribution on R . 1 d A particular special case of this construction with the {T } having correlated log-normal distributions was recently proposed in Madsen and Dalthorp (2007), where this model was referred to as lognormal-Poisson hierarchy (L-P model). While that particular model does not allow explicit forms for marginal PMFs, it proved useful for applications. Our generalization, which we shall refer to as T-Poisson hierarchy, will allow T in (13)tohave any continuous distribution on R , with margins not necessarily belonging to the same parametric family. As will be seen in the sequel, the joint PMF of this more general model canstillbe writtenasin(10), with an appropriate function h. In particular, we shall work with families of distributions of T described by marginal CDFs {F } and a copula func- tion C(u , ... , u ).Inthisset-up, thePMFof Y, which is still of the form (10), can be 1 d expressed as P(Y = y) = g (y )E {c(F (X ), ... , F (X ))} , y ∈ N , i = 1, ... , d, (14) i i 1 1 i 0 d d i=1 where the g are the marginal PMFs of {Y }, the function c(u , ... , u ) is the PDF corre- i i 1 d sponding to the copula C(u , ... , u ),and the {X } are independent random variables with 1 d i certain distributions dependent on the {y }. This expression, which is an analogue of (2) for discrete multivariate distributions defined through our scheme, provides a convenient way for computing probabilities of these multivariate distributions. This computational aspect of our construction compares favorably with a cumbersome formula for the PMF (see, e.g., Proposition 1.1 in Nikoloulopoulos and Karlis (2009)) of the competing method defined viz. (3). In what follows, we explore these ideas to provide a flexible multivariate model- ing framework for dependent count data — emphasizing computationally convenient expressions and scaleable algorithms for high-dimensional applications. We begin by showing how multivariate count data can be generated as mixtures of Poisson distri- butions by developing sequences of independent Poisson processes randomly stopped at an underlying continuous real-valued random variable T (a T−Poisson hierarchy). Then we show how our T−Poisson hierarchy scheme gives rise to computationally convenient joint probability mass functions (PMFs) and how particular choices of param- eters/distributions can be used to construct well-known models such as the multivariate negative binomial. Next, we describe a scaleable simulation algorithm using our con- struction and copula theory. Two examples are provided: a basic example to produce a multivariate geometric distribution and an elaborate high-dimensional simulation study, aiming to model and simulate RNA-sequencing data. We note that our modeling framework and computationally-convenient formulas may facilitate novel data analysis strategies, but we do not take up that task in this current study. We conclude with an Appendix containing selected proofs of assertions made throughout. 2 Multivariate mixtures of Poisson distributions Our goal is to produce a random vector Y = (Y , ... , Y ) with correlated mixed 1 d Poisson components. To this end, we start with a sequence of independent Poisson pro- cesses {N (t), t ∈ R }, i = 1, ... , d, where the rate of the process N (t) is λ .Next,welet i + i i (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 5 of 21 T = (T , ... T ) have a multivariate distribution on R with the PDF f (t).Then,we 1 T d + define Y = (Y , ... , Y ) = (N (T ), ... , N (T )). (15) 1 d 1 1 d d In the terminology of Karlis and Xekalaki (2005), this is a special case of multivariate mixed Poisson distributions of Type II. Assuming that the {N (t)} are independent of T, by standard conditioning arguments (see Lemma 7 in the Appendix)weobtain d y d λ d i − λ t i i i i=1 P(Y = y) = e t f (t)dt. (16) y ! i=1 i=1 While in some cases one can obtain explicit expressions for the above joint probabilities, in general these have to be calculated numerically. The calculations can be facilitated by certain representations of these probabilities, discussed in the Appendix (see Lemmas 7 and 8 in the Appendix). This procedure is quite general, and leads to a multitude of multivariate discrete dis- tributions. Flexible models allowing for marginal distributions of different types can be obtained by a popular approach with copulas. Assume that T has a continuous distribu- tion on R with marginal PDFs f and CDFs F driven by a particular copula C(u , ... , u ), i i 1 d so that the joint CDF of the {T } is given by F (t) = P(T ≤ t , ... , T ≤ t ) = C(F (t ), ... , F (t )), t = (t , ... , t ) ∈ R . T 1 1 1 1 1 d d d d d Then according to (2), the joint PDF f is of the form ⎧ ⎫ ⎨ ⎬ f (t) = f (t ) c(F (t ), ... , F (t )), t = (t , ... , t ) ∈ R , (17) T i i 1 1 d d 1 d ⎩ ⎭ i=1 where the function c(u , ... , u ) is the PDF corresponding to the copula CDF 1 d C(u , ... , u ). When we substitute (17)into(16), we get d y d λ d i − λ t i i i i=1 P(Y = y) = e t f (t ) c(F (t ), ... , F (t ))dt. (18) i i 1 1 d d y ! i=1 i=1 Using the results presented in the Appendix (see Lemma 7 in the Appendix), one can show that the marginal PMFs of the {Y } are given by i −λ T i i P(Y = y) = E e T = E f (W ) , (19) i λ T i i y! where f (·) is the PDF of λ T and W has a standard gamma distribution with shape λ T i i i i parameter y + 1. With this notation, we can write (18)inthe form P(Y = y) = P(Y = y ) c(F (t ), ... , F (t ))g(t|y)dt, (20) i i 1 1 d d i=1 where the quantity g(t|y) in the above integral is the joint PDF of multivariate distribution with independent margins, g(t|y) = g (t |y ) (21) i i i i=1 with y −λ t t e f (t) g (t|y) =  , t ∈ R . (22) i + −λ T i i E T e i (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 6 of 21 Thus, the integral in (20) can be expressed as c(F (t ), ... , F (t ))g(t|y)dt = E {c(F (X ), ... , F (X ))} , (23) 1 1 d d 1 1 d d where X = (X , ... , X ) has a multivariate distribution with independent components, 1 d governed by the PDF specified by (21)-(22). This leads to the following result. Proposition 1 In the above setting, the joint probabilities (18) admit the representation P(Y = y) = P(Y = y )E {c(F (X ), ... , F (X ))} , y = (y , ... , y ) ∈ N , (24) i i 1 1 d d 1 d i=1 where the marginal probabilities are given by (19)and thePDF of X = (X , ... , X ) is 1 d given by (21)-(22). Let us note that the joint moments of the Y , ... Y exist whenever their counterparts 1 d of T , ... , T are finite, in which case they can be evaluated by standard conditioning 1 d arguments. In particular, the mean and the covariance matrix of Y are related to their counterparts connected with T in a simple way, specified by Lemma 9 in the Appendix. It follows that EY = λ ET and VarY = λ ET + λ VarT , so the distributions of the {Y } i i i i i i i i are always over-dispersed. Moreover, we have Cov(Y , Y ) = λ λ Cov(T , T ), i = j, i j i j i j so that the correlation coefficient of Y and Y (if it exists) is related to that of T and T as i j i j follows: ρ = c ρ , i = j, (25) Y ,Y i,j T ,T i j i j where λ λ i j c =  , i = j. (26) i,j E(T ) E(T ) j λ + λ + i j Var(T ) Var(T ) i j Remark 1 While in general the correlation can be positive as well as negative and admits the same range as its counterpart for T and T , the range of possible correlations of Y i j i and Y can be further restricted if the margins are fixed. The maximum and minimum correlation can be deduced from (25)-(26) and the range of correlation corresponding to the joint distribution of T and T . The later is provided by the minimum and the maximum i j correlations, corresponding to the lower and the upper Fréchet copulas, C (u , u ) = max{u + u − 1, 0}, C (u , u ) = min{u , u }, u , u ∈[0,1]. (27) L 1 2 1 2 U 1 2 1 2 1 2 The upper bound for the correlation is obtained when the distribution of (T , T ) is driven i j d d −1 −1 by the upper Fréchet copula C in (27), so that T = F (U) and T = F (U),whereU U i j i j is standard uniform and the F (·),F (·) are the CDFs of T ,T , respectively. Similarly, the i j i j lower bound for the correlation is obtained when the distribution of (T , T ) is driven by the i j d d −1 −1 lower Fréchet copula C in (27), where we have T = F (U) and T = F (1 − U).While L i j i j these correlation bounds are usually not available explicitly, they can be easily obtained by Monte-Carlo approximations viz. simulation from these (degenerate) probability distribu- tions or by other standard approximate methods (see, e.g., Demitras and Hedeker (2011), and references therein). (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 7 of 21 Remark 2 We note that when a bivariate random vector Y = (Y , Y ) is defined viz. 1 2 (15) and the distribution of the corresponding T = (T , T ) is driven by one of the copulas 1 2 in (27), then the distribution of T is not absolutely continuous and the above derivations leading to the PDF of Y need a modification. It can be shown that in this case the marginal distributions of the Y are still given by (19) while the joint PMF of (Y , Y ) is also as in i 1 2 (20)withd = 2, but with the integral term replaced with 1 1 g (u|y )g (u|y )du and g (u|y )g (1 − u|y )du (28) 1 1 2 2 1 1 2 2 0 0 under the upper and the lower Fréchet copula cases, respectively, where the g (·|y) in (28) are PDFs on (0, 1) given by −1 −1 −λ F (u) e F (u) g (u|y) =   , u ∈ (0, 1), y ∈ N , i = 1, 2. (29) i 0 −λ T i i E e T Again, while the integrals in (28) are rarely available explicitly, they can be easily approx- imated by Monte-Carlo simulations in order to compute the joint PMF of Y = (Y , Y ) . 1 2 These two “extreme" distributional cases can also be used to derive the full range of the val- ues for the correlation of Y = (Y , Y ) when the marginal distributions (19)are fixed, if 1 2 needed. 2.1 Mixed Poisson distributions with NB margins We now consider the case where the mixed Poisson marginal distributions of Y are NB, so that the marginal distributions of T are gamma (see Lemma 1 in Appendix). Thus, we shall assume that the coordinates of the random vectors T have univariate standard gamma distributions with shape parameters r ∈ R , i = 1, ... , d. There have been i + numerous multivariate gamma distributions developed over the years, and we could use any of them here. However, we follow a general approach based on copulas, discussed above. Thus, we assume that the dependence structure of T is governed by some copula function C(u , ... , u ), which admits the PDF c(u , ... , u ).Inthiscase, the f in (18)are 1 d 1 d i given by (6)where r = r and the F are the corresponding CDFs. Here, the marginal PMFs i i of the {Y } in (19)are givenby (y + r ) r y P(Y = y) = p (1 − p ) , y ∈ N , (30) i i 0 (r )y! where the NB probabilities are given by p = 1/(1 + λ ) ∈ (0, 1) (so that λ = (1 −p )/p > i i i i i 0). Further, the PDF of X in Proposition 1 is still given by (21), where the marginal PDFs g (·|y ) now admit explicit expressions i i y +r i i (1 + λ ) y +r −1 −(1+λ )t i i i g (t|y ) = t e , t ∈ R . (31) i i + (y + r ) i i We recognize that these are gamma PDFs. Thus, in this special case of multivariate mixed Poisson distributions of Type II with NB marginal distributions, the random vector X in the representation (14) has multivariate gamma distribution as well, but with independent margins. This fact is summarized in the result below. Corollary 1 Let Y have a mixed Poisson distribution defined viz. (15), where the {N (·)} are independent Poisson processes with respective rates λ and T has multivariate gamma i (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 8 of 21 distribution with standard gamma margins with shape parameters r and CDFs F , gov- i i erned by a copula PDF c(u).Then, themarginalPMFsof Y are given by (30)with p = 1/(1 + λ ) ∈ (0, 1) and its joint PMF is given by (14), where X = (X , ... , X ) has i i 1 d multivariate gamma distribution with independent gamma marginal distributions of the {X } with PDFs given by (31). Remark 3 If the expectation in (14) does not admit an explicit form in terms of the y , ... , y , one can approximate its value viz. straightforward Monte-Carlo approximation 1 d involving random variate generation of independent gamma random variates {X }. Let us note that since the {T } have standard gamma distributions with shape param- eters r ,wehave E(T ) = Var(T ) = r , and an application of Lemma 9 leads to the i i i i following result. Proposition 2 Let Y have a mixed Poisson distribution defined viz. (15), where the {N (·)} are independent Poisson processes with respective rates λ and T has multivariate i i gamma distribution with standard gamma margins with shape parameters r and CDFs F , governed by a copula PDF c(u).Then, E(Y) = I(λ)r,where r = (r , ... , r ) and I(λ) i 1 d is a d × d diagonal matrix with the {λ } on the main diagonal. Moreover, the covariance matrix of Y is given by = I(λ)I(r) + I(λ) I(λ) , Y T where  is the covariance matrix of T and I(r) is a d × d diagonal matrix with the {r } T i on the main diagonal. Remark 4 The correlation of Y and Y is still given by (25), where this time i j λ λ c = , i = j, i,j 1 + λ 1 + λ i j since in (26) we have E(T ) = Var(T ). Let us note that while in principle the quantities i i c can assume any value in (0, 1) when we choose appropriate λ and λ ,theyare fixedfor i,j i j particular marginal NB distributions, since in this model the NB probabilities are given by p = 1/(1 + λ ). In the terms of the latter, we have i i c = 1 − p 1 − p , i = j. i,j i j These quantities, along with the full range of correlations for ρ in (25), can be used T ,T i j to obtain the upper and lower bounds for possible correlations of Y and Y in this model. i j We note that the possible range of ρ depends on the shape parameters r and r .Ifthe T ,T i j i j {T } are exponential (so that r = r = 1), then the upper limit of their correlation can be i i j shown to be 1. However, the full range for the correlation of T and T is usually a subset i j of [ −1, 1], which can be approximated by Monte-Carlo simulations (see Remarks 1-2) or other approximate methods (see, e.g., Demitras and Hedeker (2011)). 2.2 Simulation One particular way of defining this model, convenient for simulations, is by using the Gaussian copula to generate T. This is a very popular methodology due to its flexibility and ease of simulating from a required multivariate normal distribution. The Gaussian (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 9 of 21 copula is one that corresponds to a multivariate normal distribution with standard nor- mal marginal distributions and covariance matrix R. Since the marginals are standard normal, this R is also the correlation matrix. If F is the CDF of such multivariate normal distribution, then the corresponding Gaussian copula C is defined through F (x , ... , x ) = C ((x ), ... , (x )), R 1 d R 1 d where (·) is the standard normal CDF. Note that the copula C is simply the CDF of the random vector ((X ), ... , (X )) ,where (X , ... , X ) ∼ N (0, R). If the distribution 1 d 1 d d is continuous (so that R is non-singular), the copula C admits the PDF c ,given by R R 1 1 −1 T −1 −1 − ( (u)) (R −I ) (u)  d c (u , ... , u ) = e , u = (u , ... , u ) ∈[0,1] , R 1 d 1 d 1/2 |R| (32) −1 −1 −1 where  (u) = ( (u ), ... ,  (u )) and I is d × d identity matrix. This c will 1 d d R then be used in equations (20), (23), and (14). Simulation of multivariate gamma T with margins F based on this copula is quite simple, and involves the following steps: (i) Generate X = (X , ... , X ) ∼ N (0, R); 1 d d (ii) Transform X to U = (U , ... , U ) viz U = (X ), i = 1, ... , d; 1 d i i −1 (iii) Return T = (T , ... , T ) ,where T = F (U ), i = 1, ... , d; 1 i i Remark 5 This strategy of using Gaussian copula to generate multivariate distributions is quite popular indeed, and it became known in the literature as the NORTA (NORmal To Anything) method (see, e.g., Chen (2001); Song and Hsiao (1993)). This methodology has been recently used to generate multivariate discrete distributions, see, e.g., Barbiero and Ferrari (2017), Madsen and Birkes (2013), or Nikoloulopoulos (2013) and references therein. The standard approach discussed in these papers proceeds by simulating the vector U from the Gaussian copula following the steps (i) - (ii) above and then transforming the coordinates of U directly viz. the inverse CDFs of the components of the target random vector Y = (Y , ... , Y ) ,whichcanbedescribed as 1 d −1 (iii)’ Return Y = (Y , ... , Y ) , where Y = G (U ), i = 1, ... , d; 1 d i i Here, the G are the CDFs of the Y . If the distributions of the Y are discrete (such as NB), i i i the inverse CDF is defined in the standard way as −1 G (u) = inf{y : G(y) ≥ u}. The difference of our approach and the one discussed in the literature as described above is in the final step, regardless of the particular copula c that is used. In the standard approach one first simulates random U from c and then proceeds viz. (iii)’ above to get the target random vector Y (having a multivariate distribution with CDFs G ). On the other hand, our proposal is to first generate T viz. step (iii) above and then obtain the target variable viz. (15). While our methodology involves an extra step compared with this direct method, it offers a simple way of calculating the joint probabilities, which is not available in the other approach. Additionally, our methodology offers a stochastic explanation of the resulting distributions viz. mixing mechanism and its relation to the underlying Poisson processes, which is lacking in the somewhat artificial standard approach. Another advan- tage of the approach viz. mixed Poisson are possible extensions to more general stochastic (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 10 of 21 processes in the spirit of the NB process studied by Kozubowski and Podgórski (2009). On the other hand, its disadvantage is the fact that not all discrete marginal distributions can be obtained, only those that are mixed Poisson to begin with. Remark 6 Let us note that the mixed Poisson approach to generate multivariate distri- butions was used in Madsen and Dalthorp (2007), where Y was obtained viz. (15)with standard Poisson processes and where T = e with X being multivariate normal with mean μ = (μ , ... , μ ) and covariance matrix  =[ σ ]. Since in this case the marginals 1 d i,j of T have log-normal distributions, the authors referred to this construction as lognormal- Poisson hierarchy. This can be seen as a special case our scheme, where we have λ = e and the marginal CDFs of T of the form F (t) = (log t /σ ). The copula PDF of the {T } i i i ii i is the Gaussian copula (32)where R is the correlation matrix corresponding to . An important aspect of this problem is how to set the parameters of the underlying copula function so that the distribution of Y has given characteristics, such as the means and the covariances (and correlations). In the case where a Gaussian copula is used, this has to do with determining the correlation matrix R. This problem arises in the general scheme (i)—(iii) as well — and has been discussed in the literature (see, e.g., Barbiero and Ferrari (2017); Xiao (2017); Xiao and Zhou (2019)). Generally, there is no simple relation between R and the correlation matrix of T in (i)—(iii). However, other measure of associations - such as Kendall’s τ or Spearman’s ρ do transfer directly and may be preferred to use in our set-up. These issues will be the subject of further research. 3 Examples We provide two examples. The first example describes the T-Poisson hierarchy approach to construct a multivariate geometric distribution. Second, we demonstrate how the T- Poisson hierarchy can be used to conduct a high-dimensional (d = 1026) simulation study inspired by RNA-sequencing data — a challenging computational task. 3.1 Multivariate geometric distributions Suppose that the random vector T in (15) has marginal standard exponential distribu- tions, so that the marginal CDFs of the {T } are of the form −t F (t) = 1 − e , t ∈ R . (33) i + In this case, the {Y } have geometric distributions with parameters p = 1/(1 + λ ),sothat i i i P(Y = y) = p (1 − p ) , y ∈ N . (34) i i i 0 One can then obtain a multitude of multivariate distributions with geometric margins by selecting various copulas for the underlying distributions of T. As an example, con- sider the case with Farlie-Gumbel-Morgenstern (FGM) copula driven by a parameter θ ∈[ −1, 1], given by ⎛ ⎞ d d ⎝ ⎠ C(u) = u 1 + θ (1 − u ) , u = (u , ... , u ) ∈[0,1] . (35) i i 1 i=1 i=1 (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 11 of 21 Consider a two dimensional case with d = 2, where the PDF corresponding to (35)isof the form c(u) = 1 + θ(1 − 2u )(1 − 2u ), u = (u , u ) ∈[0,1] . (36) 1 2 1 2 In this case, the random vector X = (X , X ) in Corollary 1 has independent gamma 1 2 margins (31) with shape parameters y + 1 and scale parameters 1 + λ , i = 1, 2. Using this i i fact, coupled with (33), one can evaluate the expectation in (14), leading to y +1 y +1 1 2 1 1 E {c(F (X ), F (X ))} = 1 + θ 1 − 2 1 − 2 . (37) 1 1 2 2 1 + p 1 + p 1 2 In view of Corollary 1, this leads to the following expression for the joint probabilities of bivariate geometric distribution defined by our scheme viz. FGM copula: 2 2 y +1 y  2 P(Y = y) = p (1 −p ) 1 + θ 1 − 2 , y = (y , y ) ∈ N . i i 1 2 1 + p i=1 i=1 (38) We shall denote this distribution by GEO(p , p , θ).When θ = 0, the {Y } are independent 1 2 i geometric variables with parameters p ∈ (0, 1), i = 1, 2. Otherwise, Y , Y are correlated, i 1 2 with θ 1 − p 1 − p 1 2 Cov(Y , Y ) = , (39) 1 2 4 p p 1 2 as can be verified by routine, albeit tedious, algebra. In turn, the correlation of Y , Y 1 2 becomes ρ = 1 − p 1 − p , (40) Y ,Y 1 2 1 2 and can generally take any value in the range (−1/4, 1/4). 3.2 Simulating RNA-seq data This section describes how to simulate data using a T-Poisson hierarchy, aiming to replicate the structure of high-dimensional dependent count data. In fact, simulating RNA-sequencing (RNA-seq) data is a one of the primary motivating applications of the proposed methodology, seeking scaleable Monte Carlo methods for realistic multivariate simulation (for example, see Schissler et al. (2018)). The RNA-seq data generating process involves counting how often a particular messen- ger RNA (mRNA) is expressed in a biological sample. Since this is a counting process with no upper bound, many modeling approaches use discrete random variables with infinite support. Often the counts exhibit over-dispersion and so the negative binomial arises as a sensible model for the expression levels (gene counts). Moreover, the counts are corre- lated (co-expressed) and cannot be assumed to behave independently. RNA-seq platforms quantify the entire transcriptome in one experimental run, resulting in high-dimensional data. In humans, this results in count data corresponding to over 20,000 genes (cod- ing genomic regions) or even over 77,000 isoforms when alternating spliced mRNA are counted. This suggests simulating high-dimensional multivariate NB with heterogeneous marginals would be useful tool in the development and evaluation of RNA-seq analytics. In an illustration of our proposed methodology applied to real data, we seek to simulate RNA-sequencing data by producing simulated random vectors generated from the Type II (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 12 of 21 T-Poisson framework (as in Eq. (13)). Our goal is to replicate the structure of a breast can- cer data set (BRCA: breast cancer invasive carcinoma data set from The Cancer Genome Atlas). For simplicity, we begin by filtering to retain the top 5% highest expressing genes of the 20,501 gene measurements from N = 1212 patients’ tumor samples, resulting in d = 1026 genes. All these genes exhibit over-dispersion and, so, we proceed to estimate the NB parameters (r , p ), i = 1, ... , d, to determine the target marginal PMFs g (y ) (via i i i i −6 −2 method of moments). Notably, the p s are small — ranging in [ 3.934×10 , 1.217×10 ]. To complete the simulation algorithm inputs, we estimate the Pearson correlation matrix R and set that as the target correlation. With the simulation targets specified, we proceed to simulate B = 10, 000 random vec- tors Y = (Y , ... , Y ) with target Pearson correlation R and marginal PMFs g (y ) using 1 Y i i a T-Poisson hierarchy of Kind II. Specifically, we first employ the direct Gaussian copula approach to generate B random vectors following a standard multivariate Gamma distri- bution T with shape parameters r equal to the target NB sizes and Pearson correlation matrix R . Care must be taken when setting the specifying R (refer to Eq. (32)) — we employ Eq. (25) to compute the scaling factors c and adjust the underlying correlations i,j to ultimately match the target R . Notably, of the 525,825 pairwise correlations from the 1026 genes, no scale factor was less than 0.9907, indicating the model can produce essen- tially the entire range of possible correlations. Here we are satisfied with approximate matching of the specified Gamma correlation and set R = R in our Gaussian copula scheme (R indicating the specified multivariate Gaussian correlation matrix). Finally, we generate the desired random vector Y = N (T ) by simulating Poisson counts with i i i (1−p ) expected value μ = λ × T ,for i = 1, ... , d,(with λ = )and repeat B = 10, 000 i i i i times. Fig. 1 The T-Poisson strategy produces simulated random vectors from a multivariate negative binomial (NB) that replicate the estimated structure from an RNA-seq data set. The dashed red lines indicated equality between estimated parameters (vertical axes; derived from the simulated data) and the specified target parameters (horizontal axes) (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 13 of 21 Figure 1 shows the results of our simulation by comparing the specified target parame- ter (horizontal axes) with the corresponding quantities estimated from the simulated data (vertical axes). The evaluation shows that the simulated counts approximately match the target parameters and exhibit the full range of estimated correlation from the data. Utiliz- ing 15 CPU threads in a MacBook Pro carrying a 2.4 GHz 8-Core Intel Core i9 processor, the simulation completed in less than 30 seconds. Appendix Gamma-Poisson mixtures For the convenience of the reader, we include a short proof of the well-known fact stating that Poisson distribution with gamma-distributed parameter is NB (see, e.g., Solomon (1983)). Lemma 1 If {N (t), t ∈ R } is a homogeneous Poisson process with rate λ = (1 − p)/p > 0 and T is an independent standard gamma variable with shape parameter r, then the randomly stopped process, Y = N (T ), has a NB distribution NB(r, p) with the PMF (7). Proof Suppose that T has a standard gamma distribution with the PDF (6)and the corresponding CDF F . When we substitute the latter into (5), we obtain −λt n e (λt) 1 r−1 −t P(Y = n) = t e dt. n! (r) After some algebra, this produces n n+r (n + r) λ (1 + λ) n+r−1 −t(1+λ) P(Y = n) = t e dt. n+r (r)n! (1 + λ) (n + r) Since the integrand above is the PDF of gamma distribution with shape n + r and scale 1 + λ, the integral becomes 1 and we have r n (n + r) 1 λ P(Y = n) = , (r)n! 1 + λ 1 + λ −1 whichwerecognizeas theNBprobability from (7)with p = (1 + λ) . The result follows when we set λ = (1 − p)/p in the above analysis. Mixed multivariate Poisson distributions of type I Here we provide basic distributional facts about mixed multivariate Poisson distributions of Type I, which are the distributions of Y = (Y , ... , Y ) = (N (T ), ... , N (T )) , 1 d 1 d where the {N (·)} are independent Poisson processes with rates λ and T is a random i i variable on R , independent of the {N }. + i Lemma 2 In the above setting, the PGF of Y is given by ⎧ ⎫ ⎛ ⎞ d d d ⎨ ⎬ i  d ⎝ ⎠ G(s) = E s = φ λ − λ s , s = (s , ... , s ) ∈[0,1] , T i i i 1 d ⎩ ⎭ i=1 i=1 i=1 where φ is the LT of T. T (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 14 of 21 Proof By using standard conditioning argument, we have ⎧ ⎫ ⎧ " ⎫ d d ⎨ ⎬ ⎨ ⎬ Y Y i i G(s) = E s = E s T = t dF (t). (41) i i " ⎩ ⎭ ⎩ ⎭ i=1 i=1 Since given T = t the variables {Y } are independent and Poisson distributed with means {λ t}, respectively, we have ⎧ " ⎫ d d d ⎨ ⎬ d d " # " $ −t λ − λ s Y i i i i Y −λ t(1−s ) i=1 i=1 " i" i i E s T = t = E s T = t = e = e . i " ⎩ ⎭ i=1 i=1 i=1 When we substitute the above into (41) we conclude that the PGF of Y is indeed of the form stated above. Remark 7 Note that in the dimensional case d = 1, we recover the well-known formula for the PGF of Y = N (T ), G(s) = φ (λ(1 − s)), s ∈[0,1], (42) where λ> 0 is the rate of the Poisson process {N (t), t ∈ R }. If we further assume that T is standard gamma distributed with shape parameter r > 0,sothat φ (t) = , t ∈ R , T + 1 + t and we take λ = (1 − p)/p, we obtain G(s) = , s ∈[0,1]. (43) 1 − (1 − p)s We recognize this as the PGF of the NB distribution NB(r, p), as it should be according to Lemma 1. Similarly, the PGF of a d-dimensional mixed Poisson distribution with such a gamma distributed T takes on the form % & G(s) = , s = (s , ... , s ) ∈[0,1] , 1 d Q − P s i i i=1 where P = λ and Q = 1 + P . This is a general form of multivariate negative i i i i=1 multinomial distribution (see Chapter 36 of Johnson et al. (1997)). Since the PGF of the −1 marginal distributions of Y in this setting is of the form (43)withp = (1 + λ ) ,all i i marginal distributions are NB. Due to this property, discrete multivariate distributions with the above PGFs have been termed multivariate NB distributions (for more details, see Johnson et al. (1997)). Remark 8 Let us note that changing a scaling factor of the variable T in this model has the same effect as adjusting the rate parameters connected with the Poisson processes {N (·)}. Namely, it follows from Lemma 2 that if we let T = cT in the above setting, then we have the following equality in distribution: ' ( ˜ ˜ ˜ ˜ N T , ... , N T = N (T ), ... , N (T ) , (44) 1 1 d d where the {N (·)} are independent Poisson processes with rates cλ , respectively. Thus, with- i i out loss of generality, we may assume that the scale parameter of the variable T in this model is set to unity. (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 15 of 21 Lemma 3 In the above setting, the PMF of Y is given by P(Y = y) = g (y )h(y), y = (y , ... , y ) ∈ N , i i 1 d i=1 where g (y) = v (y, λ ), y ∈ N , i T i 0 y! are the marginal PMFs of the {Y }, y −λT v (y, λ) = E T e , λ, y ∈ R , T + and the function h is given by d d v y , λ T i i i=1 i=1 h(y) = , y = (y , ... , y ) ∈ N . ) 1 d v (y , λ ) T i i i=1 Proof Since given T = t the variables {Y } are independent and Poisson distributed with means {λ t}, respectively, by using standard conditioning argument, followed by some algebra, we have ⎡ ⎤ ⎡ ⎛ ⎞ ⎤ d y d y d d i i λ d d λ i −t λ y i i i i=1 i=1 ⎣ ⎦ ⎣ ⎝ ⎠ ⎦ P(Y = y) = e t dF (t) = v y , λ . T T i i y ! y ! i R i i=1 i=1 i=1 i=1 (45) Similarly, the marginal PMFs are given by y y λ λ i −tλ y i P(Y = y) = e t dF (t) = v (y, λ ) . (46) i T T i y! y! By combining (45)and (46), we obtain the result. Remark 9 Note that the joint PMF of Y canbealsowrittenas ⎛ ⎞ d d d y ⎝ ⎠ P(Y = y) = v y , λ , (47) T i i y ! i=1 i=1 i=1 which is a convenient expression for approximating these probabilities by Monte Carlo simulations if the function v (·, ·) is not available explicitly and the random variable T is straightforward to simulate. We also note that whenever the marginal PMFs of Y are explicit, then so is the function v (·, ·), which is clear from Lemma 3. For example, if T is standard gamma with shape parameter r, then we have r+y (r + y) 1 y! v (y, λ) = = P(Y = y), λ, y ∈ R , T + (r) 1 + λ λ where Y has a NB distribution with parameters r and p = 1/(1 + λ). Next, we present an alternative expression for the joint probabilities P(Y = y), which provides a convenient formula for their computation whenever the variable T is difficult to simulate but its PDF is easy to compute. This representation involves a multinomial random vector N = (N , ... , N ) with parameters n and p = (p , ... , p ) , denoted 1 d 1 d (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 16 of 21 by MUL(n, p),where n ∈ N represents thenumberoftrials, the {p } represent event probabilities that sum up to one, and ⎧ ⎫ ⎨ ⎬ n! y y 1 d d d P(N = y) = p ··· p , y ∈ k = (k , ... , k ) ∈ N : k = n . (48) 1 i 1 d 0 ⎩ ⎭ y ! ··· y ! i=1 Lemma 4 In the above setting, the PMF of Y is given by ' ( P(Y = y) = P(N = y)E f (W ) , y = (y , ... , y ) ∈ N , (49) λT 1 d d where λ = λ , N ∼ MUL(n, p) with n = y and p = λ /λ,the quantity f is i i i i λT i=1 i=1 the PDF of λT, and W has standard gamma distribution with shape parameter n + 1. Proof Proceeding as in the proof of Lemma 3,weobtain d y n+1 λ n! 1 λ i (n+1)−1 −λt P(Y = y) = t e f (t)dt. (50) y ! λ λ n! i R i=1 Since the integrand is the product of f (t) and the density of gamma random variable X with shape parameter n + 1and scale λ,wehave . / .  / n+1 1 λ 1 1 W (n+1)−1 −λt t e f (t)dt = E f (X) = E f , T T T λ n! λ λ λ where W = λX has standard gamma distribution with shape parameter n + 1(andscale 1). To conclude the result, observe that the expression d y λ n! y ! λ i=1 in (50) coincides with the multinomial probablity (48)with p = λ /λ while i i 1 w f = f (w). T λT λ λ Remark 10 Note that in one dimensional case where d = 1 the multinomial probability in (49) reduces to 1, and we obtain ' ( P(Y = y) = E f (W ) , y ∈ N , (51) λT 0 where Y = N (T ), {N (t), t ∈ R } is a Poisson process with rate λ,the quantity f is the + λT PDF of λT, the variable W has standard gamma distribution with shape parameter y + 1, and T is independent of the Poisson process. Finally, we present well-known results concerning the mean and the covariance struc- ture of mixed multivariate Poisson distributions of Type I, which are easily derived through standard conditioning arguments. Generally, whenever the mean of T exists then so does the mean of each Y ,and we have E(Y ) = λ E(T ). Moreover, the vari- i i i ance of each Y is finite whenever T has a finite second moment, in which case we have Var(Y ) = λ E(T ) + λ Var(T ).Thus, thevarianceof Y is greater than the mean, and the i i i distribution of Y is over-dispersed. Finally, under the latter assumption, the covariance of Y and Y exists and equals Cov(Y , Y ) = λ λ Var(T ). The result below summarizes these i j i j i j facts. (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 17 of 21 Lemma 5 In the above setting, the mean vector of Y exists whenever the mean of T is finite, in which case we have E(Y) = λE(T ),where λ = (λ , ... , λ ) . Moreover, if T has a finite second moment, then the covariance matrix of Y is well defined and is given by = E(T )I(λ) + Var(T )λλ , where I(λ) is a d × d diagonal matrix with the {λ } on the main diagonal. Remark 11 By the above result, if it exists, the correlation coefficient of Y and Y is given i j by λ λ i j ρ =   . i,j E(T ) E(T ) λ + λ + i j Var(T ) Var(T ) The correlation is always positive, and can generally fall anywhere within the boundaries of zero and one. Mixed multivariate Poisson distributions of type II Here we provide basic distributional facts about mixed multivariate Poisson distributions of Type II, which are the distributions of Y = (Y , ... , Y ) = (N (T ), ... , N (T )) , 1 d 1 1 d d where the {N (·)} are independent Poisson processes with rates λ and T = (T , ... T ) i i 1 d is a random vector in R with the PDF f , independent of the {N }. T i Lemma 6 In the above setting, the PGF of Y is given by ⎧ ⎫ ⎨ ⎬ i  d G(s) = E s = φ (I(λ)(1 − s)), s = (s , ... , s ) ∈[0,1] , (52) T 1 ⎩ ⎭ i=1 where φ is the LT of T, I(λ) is a d × d diagonal matrix with the {λ } on the main diagonal, T i and 1 is a d-dimensional column vector of 1s. Proof By using standard conditioning argument, we have ⎧ ⎫ ⎧ " ⎫ d d ⎨ ⎬ ⎨ ⎬ Y Y i i G(s) = E s = E s T = t dF (t). (53) i i " ⎩ ⎭ d ⎩ ⎭ i=1 i=1 Since given T = t the variables {Y } are independent and Poisson distributed with means {λ t }, respectively, we have i i ⎧ " ⎫ d d d ⎨ ⎬ " # " $ Y −λ t (1−s ) −t I(λ)(1−s) i i i j j " " E s T = t = E s T = t = e = e . ⎩ ⎭ i=1 i=1 j=1 When we substitute the above into (53) we conclude that the PGF of Y is as stated in the lemma. Remark 12 Note that the expression (52) is a generalization of (42)tothe multivariate case of mixed Poisson. Additionally, observe that if the components of T coincide, that is T = Tfor i = 1, ... d, we have −t T −(t +···+t )T 1 d φ (t) = E e = E e = φ (t + ··· + t ), T T 1 d and the PGF in (52) reduces to its counterpart provided in Lemma 2,asitshould. (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 18 of 21 Remark 13 Let us note that changing scaling factors of the variables T in this model has the same effect as adjusting the rate parameters connected with the Poisson processes {N (·)}. Namely, it follows from Lemma 6 that if we let T = c T in the above setting, then i i i i we have the following equality in distribution: ' ( ˜ ˜ ˜ ˜ N (T ), ... , N (T ) = N (T ), ... , N (T ) , (54) 1 1 d d 1 1 d d where the {N (·)} are independent Poisson processes with rates c λ , respectively. Thus, i i i without loss of generality, we may assume that the scale parameters of the variables T in this model are set to unity. Next, we provide a convenient formula for the PMF of multivariate mixed Poisson dis- tributions of Type II, which is an extension of that given in Lemma 3. To state the result, we extend the definition of the function v described by (12) to vector-valued arguments d   d and random vectors T in R . Namely, for a = (a , ... , a ) , b = (b , ... , b ) ∈ R we 1 d 1 d + + set b b a = a (55) i=1 and define y −λ T d v (y, λ) = E T e , λ, y ∈ R . (56) Lemma 7 In the above setting, the PMF of Y is given by P(Y = y) = g (y )h(y), y = (y , ... , y ) ∈ N , i i 1 d i=1 where g (y) = v (y, λ ), y ∈ N , i T i 0 y! are the marginal PMFs of the {Y } and the function h is given by v (y, λ) h(y) = , y = (y , ... , y ) ∈ N . ) 1 d v (y , λ ) T i i i=1 i Proof By using standard conditioning argument, we have P(Y = y) = P(N (T ) = y , ... , N (T ) = y |T = t)f (t)dt, (57) 1 1 1 d d d T where y = (y , ... , y ) and t = (t , ... , t ) . Further, by independence, we have 1 d 1 d P(N (T ) = y , ... , N (T ) = y |T = t) = P(N (t ) = y ). (58) 1 1 1 d d d i i i i=1 Since the N (t ) are Poisson with parameters λ t ,wehave i i i i −λ t y i i i e (λ t ) i i P(N (t ) = y ) = , i = 1, ... , d. (59) i i i y ! When we now substitute (58)-(59)into(57), then after some algebra we get ⎡ ⎤ y y d d d i i λ d λ i − λ t i i i i i=1 ⎣ ⎦ P(Y = y) = e t f (t)dt = v (y, λ) . (60) T T y ! y ! i i i=1 i=1 i=1 (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 19 of 21 Similarly, the marginal PMFs are given by y y λ λ i −tλ y i P(Y = y) = e t dF (t) = v (y, λ ) . (61) i T T i i i y! y! By combining (60)and (61), we obtain the result. We now present an alternative expression for the joint probabilities P(Y = y), which facilitates their computation if the random vector T is difficult to simulate but its PDF is readily available. Lemma 8 In the above setting, the PMF of Y is given by ' ( P(Y = y) = E f (W) , y = (y , ... , y ) ∈ N , (62) I(λ)T 1 d where the quantity f is the PDF of I(λ)T = (λ T , ... , λ T ) and W = I(λ)T 1 1 d d (W , ... , W ) with mutually independent W having standard gamma distributions with 1 d i shape parameters y + 1. Proof Proceeding as in the proof of Lemma 4,weobtain d d y +1 1 λ (y +1)−1 i i −λ t i i P(Y = y) = t e f (t)dt. (63) λ y ! i R i i=1 i=1 Note that the product under the integral above is the PDF of X = (X , ... , X ) ,where 1 d the X are mutually independent gamma random variables with shape parameters y + 1 i i and scale parameters λ . This allows us to conclude that ⎡ ⎤ ⎡ ⎤ d d 1 1 W W ⎣ ⎦ ⎣ ⎦ P(Y = y) = E f (X) = E f , ... , , T T λ λ λ λ i i 1 i=1 i=1 where W = (W , ... , W ) = I(λ)X has independent standard gamma components with 1 d shape parameters y + 1. To conclude the result, observe that 1 W W 1 d f , ... , = f (W). T I(λ)T λ λ λ i 1 d i=1 Finally, let us summarize standard results concerning the mean and the covariance structure of mixed multivariate Poisson distributions of Type II, which parallel the results for Type I, and are easily derived through standard conditioning arguments. Gener- ally, whenever the means of {T } exist then so do the means of the {Y },and we have i i E(Y ) = λ E(T ). Similarly, the variance of each Y is finite whenever T has a finite second i i i i i moment, in which case we have Var(Y ) = λ E(T ) + λ Var(T ). Again, the distribution i i i i of Y is always over-dispersed. Finally, for any i = j, the covariance of Y and Y exists and i i j equals Cov(Y , Y ) = λ λ Cov(T , T ) whenever the covariance of T and T exists. These i j i j i j i j facts are summarized in the result below. Lemma 9 In the above setting, the mean vector of Y exists whenever the mean of T is finite, in which case we have E(Y) = I(λ)E(T),where λ = (λ , ... , λ ) and I(λ) is a d ×d 1 d (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 20 of 21 diagonal matrix with the {λ } on the main diagonal. Moreover, if T has a finite covariance matrix  then the covariance matrix of Y is well defined as well and is given by = I(λ)I(E(T)) + I(λ) I(λ) , Y T where I(E(T)) is a d × d diagonal matrix with the diagonal entries {E(T )}. Abbreviations BRCA: Breast invasive carcinoma; CDF: Cumulative distribution function; FGM: Farlie-Gumbel-Morgenstern; L-P model: lognormal-Poisson model; mRNA: messenger ribonucleic acid; NB: Negative binomial; NORTA: NORmal To Anything; PDF: Probability density functions; PGF: Probability generating function; PMF: Probability mass function; RNA-seq: RNA-sequencing Acknowledgements The authors thank the two reviewers for their comments that help improve the paper. We also thank Professors Walter W. Piegorsch and Edward J. Bedrick (University of Arizona) for their helpful discussions. Authors’ contributions AGS and TJK conceived the study. TJK, AKP, and AGS developed the approach. ADK and AGS conducted the computational analyses. TJK, AKP, and AGS wrote the manuscript. TJK, AKP, AGS, ADK revised the manuscript. All authors read and approved with the final document. Funding Research reported in this publication was supported by MW-CTR-IN of the National Institutes of Health under award number 1U54GM104944. Availability of data and materials The results shown here are in whole or part based upon data generated by the TCGA Research Network: https://www. cancer.gov/tcga. Code reproducing the BRCA data set and computational analyses is available from the corresponding author on reasonable request. Declarations Competing interests The authors declare that they have no competing interests. Received: 28 October 2020 Accepted: 2 March 2021 References Barbiero, A., Ferrari, P. A.: An R package for the simulation of correlated discrete variables. Comm. Statist. Simul. Comput. 46(7), 5123–5140 (2017) Chen, H.: Initialization for NORTA: Generation of random vectors with specified marginals and correlations. INFORMS J. Comput. 13(4), 257–360 (2001) Clemen, R. T., Reilly, T.: Correlations and copulas for decision and risk analysis. Manag. Sci. 45, 208–224 (1999) Demitras, H., Hedeker, D.: A practical way for computing approximate lower and upper correlation bounds. Amer. Statist. 65(2), 104–109 (2011) Johnson, N., Kotz, S., Balakrishnan, N.: Discrete Multivariate Distributions. Wiley, New York (1997) Karlis, D., Xekalaki, E.: Mixed Poisson distributions. Intern. Statist. Rev. 73(1), 35–58 (2005) Kozubowski, T. J., Podgórski, P.: Distribution properties of the negative binomial Lévy process. Probab. Math. Statist. 29, 43–71 (2009) Madsen, L., Birkes, D.: Simulating dependent discrete data. J. Stat. Comput. Simul. 83(4), 677–691 (2013) Madsen, L., Dalthorp, D.: Simulating correlated count data. Environ. Ecol. Stat. 14(2), 129–148 (2007) Nelsen, R. B.: An Introduction to Copulas (2006) Nikoloulopoulos, A. K.: Copula-based models for multivariate discrete response data. In: Copulae in Mathematical and Quantitative Finance, 231–249, Lect. Notes Stat., 213. Springer, Heidelberg, (2013) Nikoloulopoulos, A. K., Karlis, D.: Modeling multivariate count data using copulas. Comm. Statist. Sim. Comput. 39(1), 172–187 (2009) Schissler, A. G., Piegorsch, W. W., Lussier, Y. A.: Testing for differentially expressed genetic pathways with single-subject N-of-1 data in the presence of inter-gene correlation. Stat. Methods Med. Res. 27(12), 3797–3813 (2018) Solomon, D. L.: The spatial distribution of cabbage butterfly eggs. In: Roberts, H., Thompson, M. (eds.) Life Science Models Vol. 4, pp. 350–366. Springer-Verlag, New York, (1983) Song, W. T., Hsiao, L.-C.: Generation of autocorrelated random variables with a specified marginal distribution. In: Proceedings of 1993 Winter Simulation Conference - (WSC ’93), pp. 374–377, Los Angeles, (1993). https://doi.org/10. 1109/WSC.1993.718074 Xiao, Q.: Generating correlated random vector involving discrete variables. Comm. Statist. Theory Methods. 46(4), 1594–1605 (2017) (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 21 of 21 Xiao, Q., Zhou, S.: Matching a correlation coefficient by a Gaussian copula. Comm. Statist. Theory Methods. 48(7), 1728–1747 (2019) Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Statistical Distributions and Applications Springer Journals

A flexible multivariate model for high-dimensional correlated count data

Loading next page...
 
/lp/springer-journals/a-flexible-multivariate-model-for-high-dimensional-correlated-count-UEsJiI3nF8

References (4)

  • H. Demitras (2011)

    104

    Amer. Statist., 65

  • A. Barbiero (2017)

    5123

    Comm. Statist. Simul. Comput, 46

  • H. Chen (2001)

    257

    INFORMS J. Comput, 13

  • R. T. Clemen (1999)

    208

    Manag. Sci., 45

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2021
eISSN
2195-5832
DOI
10.1186/s40488-021-00119-y
Publisher site
See Article on Publisher Site

Abstract

aschissler@unr.edu Department of Mathematics & We propose a flexible multivariate stochastic model for over-dispersed count data. Our Statistics, University of Nevada, methodology is built upon mixed Poisson random vectors (Y , ... , Y ),where the {Y } 89557 Reno, USA 1 i are conditionally independent Poisson random variables. The stochastic rates of the {Y } are multivariate distributions with arbitrary non-negative margins linked by a copula function. We present basic properties of these mixed Poisson multivariate distributions and provide several examples. A particular case with geometric and negative binomial marginal distributions is studied in detail. We illustrate an application of our model by conducting a high-dimensional simulation motivated by RNA-sequencing data. Keywords: Multivariate count data, Copula, Distribution theory, Big data applications, Gamma-Poisson hierarchy, Mixed Poisson distribution, Negative binomial distribution, High-dimensional multivariate simulation, RNA-sequencing data AMS Subject Classification: 62E10; 62E15; 62H05; 62H10; 62H30 1 Introduction As multivariate count data become increasingly common across many scientific disci- plines, including economics, finance, geosciences, biology, marketing, and others, there is a growing need for flexible families of multivariate distributions with discrete mar- gins. In particular, flexible models with correlated classical marginal distributions are in high demand in many different applied areas (see, e.g., Barbiero and Ferrari (2017); Mad- sen and Birkes (2013); Madsen and Dalthorp (2007); Nikoloulopoulos and Karlis (2009); Xiao (2017)). With this in mind, we propose a general method of constructing discrete multivariate distributions with certain common marginal distributions. One important example of this construction is a discrete multivariate model with correlated negative binomial (NB) components and arbitrary parameters. However, our approach is quite general and can produce families with different margins, going beyond the NB case. One way to generate multivariate distributions with particular margins is an approach through copulas (see, e.g., Nelsen (2006)), and multivariate discrete distributions con- structed through this method have been proposed in recent years (see, e.g., Barbiero and Ferrari (2017); Madsen and Birkes (2013); Nikoloulopoulos (2013); Nikoloulopoulos and Karlis (2009); Xiao (2017) and references therein). Recall that a copula is a cumulative © The Author(s). 2021 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 2 of 21 distribution function (CDF) on [ 0, 1] , describing a random vector with standard uni- form margins. Moreover, for any random vector X = (X , ... , X ) with the joint CDF F and marginal CDFs F there is a copula function C(u , ... , u ) so that i 1 d F(x , ... , x ) = P(X ≤ x , ... , X ≤ x ) = C(F (x ), ... , F (x )), x ∈ R, i = 1, ... , d. 1 d 1 1 d d 1 1 d d i (1) Further, for continuous distributions with marginal probability density functions (PDFs) f (x) = F (x), the copula function C is unique, and the joint PDF of the {X } is given by i i ⎧ ⎫ ⎨ ⎬ f (x , ... , x ) = f (x ) c(F (x ), ... , F (x )), x ∈ R, i = 1, ... , d,(2) 1 d i i 1 1 d d i ⎩ ⎭ i=1 where the function c(u , ... , u ) is the PDF corresponding to the copula C(u , ... , u ). 1 d 1 d However, for discrete distributions, the copula is no longer unique and there is no ana- logue of (2) for calculating the relevant probabilities. Using this concept, one can define a random vector Y = (Y , ... , Y ) in R with arbitrary marginal CDFs F viz. 1 i −1 −1 (Y , ... , Y ) = F (U ), ... , F (U ),(3) 1 d 1 d 1 d where U = (U , ... , U ) is a random vector with standard uniform margins and the 1 d CDF given by F (u , ... , u ) = P(U ≤ u , ... , U ≤ u ) = C(u , ... , u ), (u , ... , u ) ∈[0,1] , U 1 n 1 1 1 1 d d d d (4) with a particular copula C. While one can use any of the multitude of different copula functions in this construction, an approach based on Gaussian copula, known as NORTA (NORmal To Anything, see, e.g., Chen (2001); Song and Hsiao (1993)), is especially popu- lar due to its flexibility, particularly in the case of discrete distributions (see, e.g., Barbiero and Ferrari (2017); Madsen and Birkes (2013); Nikoloulopoulos (2013)). While our approach involves copulas as well, the latter connect with continuous multi- variate distributions rather than discrete, which avoids the issues with non-uniqueness of the copula function. Additionally, compared with the direct approach (3), in our scheme the computation of relevant probabilities is straightforward. Our methodology is based on mixtures of Poisson distributions, which is a common way of obtaining discrete analogs of continuous distributions on nonnegative reals with a particular stochastic interpretation. Indeed, discrete univariate mixed Poisson distributions have been proven useful stochas- tic models in many scientific fields (see, e.g., Karlis and Xekalaki (2005), where one can find a comprehensive review of these distributions with over 30 particular examples). This construction can be described through a randomly stopped Poisson process. More pre- cisely, let {N (t), t ∈ R } be a homogeneous Poisson process with rate λ> 0, so that the marginal distribution of N (t) is Poisson with parameter (mean) λt. Then, for any random variable T with cumulative distribution function (CDF) F ,supported on R ,the quan- T + tity Y = N (T ) is an integer-valued random variable, with distribution determined viz. standard conditioning argument as follows: −λt n e (λt) P(Y = n) = dF (t), n ∈ N ={0, 1, ...}.(5) T 0 n! + (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 3 of 21 Many standard probability distributions on N arise from this scheme. In particular, if T has a standard gamma distribution with shape parameter r > 0, given by the PDF r−1 −x f (x) = x e , x ∈ R,(6) (r) then Y = N (T ) will have a NB distribution NB(r, p) with the probability mass function (PMF) (n + r) r n P(Y = n) = p (1 − p) , n ∈ N,(7) (r)n! where p = 1/(1+λ) (see Section 3.2 in the Appendix). As the NB model is quite important across many applications and can be extended to more general stochastic processes (see, e.g., Kozubowski and Podgórski (2009)), it shall serve as a basic example of our approach. An extension of this scheme to the multivariate case can be accomplished in two differ- ent ways, leading to mixed multivariate Poisson distributions of Kind (or Type) I and II in the terminology of Karlis and Xekalaki (2005). The former arises viz. Y = (Y , ... , Y ) = (N (T ), ... , N (T )),(8) 1 d 1 d where the {N (·)} are Poisson processes with rates λ and T is, as before, a random vari- i i able on R , independent of the {N }. While in general the marginal distributions of + i (N (t), ... , N (t)) can be correlated multivariate Poisson (see, Johnson et al. (1997)), we 1 d shall assume that the processes {N } are mutually independent. In this case, the joint probability generating function (PGF) of the {Y } in (8)isofthe form ⎧ ⎫ ⎛ ⎞ d d d ⎨ ⎬ i  d ⎝ ⎠ G(s , ... , s ) = E s = φ λ − λ s , (s , ... , s ) ∈[0,1] ,(9) 1 d T i i i 1 d ⎩ ⎭ i=1 i=1 i=1 where φ is the Laplace transform (LT) of T, while the relevant probabilities can be conveniently expressed as P(Y = y) = g (y )h(y), y = (y , ... , y ) ∈ N , (10) i i 1 i=1 where the {g } in (10)are the marginal PMFs of the {Y }. As shown in the Appendix, the i i function h is of the form d d v y , λ T i i i=1 i=1 h(y) = , y = (y , ... , y ) ∈ N , (11) v (y , λ ) T i i i=1 where −λT y v (y, λ) = E e T , λ, y ∈ R . (12) T + In case of gamma distributed T, with shape parameter r > 0 and unit scale, the functions v and h above can be evaluated explicitly (see the Appendix for details), and the above distribution is know in the literature as the multivariate negative multinomial distribu- tion (see Chapter 36 of Johnson et al. (1997) and references therein). Since the marginal distributions in this case are NB, the distribution has also been termed multivariate NB. In cases where the function v(·, ·) is not available explicitly, it can be easily evaluated numerically, viz. Monte Carlo simulations. Our main focus will be a more flexible family of mixed Poisson distributions of Kind II, where each Poisson process {N (t), t ∈ R } is randomly stopped at a different random i + variable T , leading to i (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 4 of 21 Y = (Y , ... , Y ) = (N (T ), ... , N (T )), (13) 1 1 1 d d d where the random vector T = (T , ... , T ) follows a multivariate distribution on R . 1 d A particular special case of this construction with the {T } having correlated log-normal distributions was recently proposed in Madsen and Dalthorp (2007), where this model was referred to as lognormal-Poisson hierarchy (L-P model). While that particular model does not allow explicit forms for marginal PMFs, it proved useful for applications. Our generalization, which we shall refer to as T-Poisson hierarchy, will allow T in (13)tohave any continuous distribution on R , with margins not necessarily belonging to the same parametric family. As will be seen in the sequel, the joint PMF of this more general model canstillbe writtenasin(10), with an appropriate function h. In particular, we shall work with families of distributions of T described by marginal CDFs {F } and a copula func- tion C(u , ... , u ).Inthisset-up, thePMFof Y, which is still of the form (10), can be 1 d expressed as P(Y = y) = g (y )E {c(F (X ), ... , F (X ))} , y ∈ N , i = 1, ... , d, (14) i i 1 1 i 0 d d i=1 where the g are the marginal PMFs of {Y }, the function c(u , ... , u ) is the PDF corre- i i 1 d sponding to the copula C(u , ... , u ),and the {X } are independent random variables with 1 d i certain distributions dependent on the {y }. This expression, which is an analogue of (2) for discrete multivariate distributions defined through our scheme, provides a convenient way for computing probabilities of these multivariate distributions. This computational aspect of our construction compares favorably with a cumbersome formula for the PMF (see, e.g., Proposition 1.1 in Nikoloulopoulos and Karlis (2009)) of the competing method defined viz. (3). In what follows, we explore these ideas to provide a flexible multivariate model- ing framework for dependent count data — emphasizing computationally convenient expressions and scaleable algorithms for high-dimensional applications. We begin by showing how multivariate count data can be generated as mixtures of Poisson distri- butions by developing sequences of independent Poisson processes randomly stopped at an underlying continuous real-valued random variable T (a T−Poisson hierarchy). Then we show how our T−Poisson hierarchy scheme gives rise to computationally convenient joint probability mass functions (PMFs) and how particular choices of param- eters/distributions can be used to construct well-known models such as the multivariate negative binomial. Next, we describe a scaleable simulation algorithm using our con- struction and copula theory. Two examples are provided: a basic example to produce a multivariate geometric distribution and an elaborate high-dimensional simulation study, aiming to model and simulate RNA-sequencing data. We note that our modeling framework and computationally-convenient formulas may facilitate novel data analysis strategies, but we do not take up that task in this current study. We conclude with an Appendix containing selected proofs of assertions made throughout. 2 Multivariate mixtures of Poisson distributions Our goal is to produce a random vector Y = (Y , ... , Y ) with correlated mixed 1 d Poisson components. To this end, we start with a sequence of independent Poisson pro- cesses {N (t), t ∈ R }, i = 1, ... , d, where the rate of the process N (t) is λ .Next,welet i + i i (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 5 of 21 T = (T , ... T ) have a multivariate distribution on R with the PDF f (t).Then,we 1 T d + define Y = (Y , ... , Y ) = (N (T ), ... , N (T )). (15) 1 d 1 1 d d In the terminology of Karlis and Xekalaki (2005), this is a special case of multivariate mixed Poisson distributions of Type II. Assuming that the {N (t)} are independent of T, by standard conditioning arguments (see Lemma 7 in the Appendix)weobtain d y d λ d i − λ t i i i i=1 P(Y = y) = e t f (t)dt. (16) y ! i=1 i=1 While in some cases one can obtain explicit expressions for the above joint probabilities, in general these have to be calculated numerically. The calculations can be facilitated by certain representations of these probabilities, discussed in the Appendix (see Lemmas 7 and 8 in the Appendix). This procedure is quite general, and leads to a multitude of multivariate discrete dis- tributions. Flexible models allowing for marginal distributions of different types can be obtained by a popular approach with copulas. Assume that T has a continuous distribu- tion on R with marginal PDFs f and CDFs F driven by a particular copula C(u , ... , u ), i i 1 d so that the joint CDF of the {T } is given by F (t) = P(T ≤ t , ... , T ≤ t ) = C(F (t ), ... , F (t )), t = (t , ... , t ) ∈ R . T 1 1 1 1 1 d d d d d Then according to (2), the joint PDF f is of the form ⎧ ⎫ ⎨ ⎬ f (t) = f (t ) c(F (t ), ... , F (t )), t = (t , ... , t ) ∈ R , (17) T i i 1 1 d d 1 d ⎩ ⎭ i=1 where the function c(u , ... , u ) is the PDF corresponding to the copula CDF 1 d C(u , ... , u ). When we substitute (17)into(16), we get d y d λ d i − λ t i i i i=1 P(Y = y) = e t f (t ) c(F (t ), ... , F (t ))dt. (18) i i 1 1 d d y ! i=1 i=1 Using the results presented in the Appendix (see Lemma 7 in the Appendix), one can show that the marginal PMFs of the {Y } are given by i −λ T i i P(Y = y) = E e T = E f (W ) , (19) i λ T i i y! where f (·) is the PDF of λ T and W has a standard gamma distribution with shape λ T i i i i parameter y + 1. With this notation, we can write (18)inthe form P(Y = y) = P(Y = y ) c(F (t ), ... , F (t ))g(t|y)dt, (20) i i 1 1 d d i=1 where the quantity g(t|y) in the above integral is the joint PDF of multivariate distribution with independent margins, g(t|y) = g (t |y ) (21) i i i i=1 with y −λ t t e f (t) g (t|y) =  , t ∈ R . (22) i + −λ T i i E T e i (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 6 of 21 Thus, the integral in (20) can be expressed as c(F (t ), ... , F (t ))g(t|y)dt = E {c(F (X ), ... , F (X ))} , (23) 1 1 d d 1 1 d d where X = (X , ... , X ) has a multivariate distribution with independent components, 1 d governed by the PDF specified by (21)-(22). This leads to the following result. Proposition 1 In the above setting, the joint probabilities (18) admit the representation P(Y = y) = P(Y = y )E {c(F (X ), ... , F (X ))} , y = (y , ... , y ) ∈ N , (24) i i 1 1 d d 1 d i=1 where the marginal probabilities are given by (19)and thePDF of X = (X , ... , X ) is 1 d given by (21)-(22). Let us note that the joint moments of the Y , ... Y exist whenever their counterparts 1 d of T , ... , T are finite, in which case they can be evaluated by standard conditioning 1 d arguments. In particular, the mean and the covariance matrix of Y are related to their counterparts connected with T in a simple way, specified by Lemma 9 in the Appendix. It follows that EY = λ ET and VarY = λ ET + λ VarT , so the distributions of the {Y } i i i i i i i i are always over-dispersed. Moreover, we have Cov(Y , Y ) = λ λ Cov(T , T ), i = j, i j i j i j so that the correlation coefficient of Y and Y (if it exists) is related to that of T and T as i j i j follows: ρ = c ρ , i = j, (25) Y ,Y i,j T ,T i j i j where λ λ i j c =  , i = j. (26) i,j E(T ) E(T ) j λ + λ + i j Var(T ) Var(T ) i j Remark 1 While in general the correlation can be positive as well as negative and admits the same range as its counterpart for T and T , the range of possible correlations of Y i j i and Y can be further restricted if the margins are fixed. The maximum and minimum correlation can be deduced from (25)-(26) and the range of correlation corresponding to the joint distribution of T and T . The later is provided by the minimum and the maximum i j correlations, corresponding to the lower and the upper Fréchet copulas, C (u , u ) = max{u + u − 1, 0}, C (u , u ) = min{u , u }, u , u ∈[0,1]. (27) L 1 2 1 2 U 1 2 1 2 1 2 The upper bound for the correlation is obtained when the distribution of (T , T ) is driven i j d d −1 −1 by the upper Fréchet copula C in (27), so that T = F (U) and T = F (U),whereU U i j i j is standard uniform and the F (·),F (·) are the CDFs of T ,T , respectively. Similarly, the i j i j lower bound for the correlation is obtained when the distribution of (T , T ) is driven by the i j d d −1 −1 lower Fréchet copula C in (27), where we have T = F (U) and T = F (1 − U).While L i j i j these correlation bounds are usually not available explicitly, they can be easily obtained by Monte-Carlo approximations viz. simulation from these (degenerate) probability distribu- tions or by other standard approximate methods (see, e.g., Demitras and Hedeker (2011), and references therein). (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 7 of 21 Remark 2 We note that when a bivariate random vector Y = (Y , Y ) is defined viz. 1 2 (15) and the distribution of the corresponding T = (T , T ) is driven by one of the copulas 1 2 in (27), then the distribution of T is not absolutely continuous and the above derivations leading to the PDF of Y need a modification. It can be shown that in this case the marginal distributions of the Y are still given by (19) while the joint PMF of (Y , Y ) is also as in i 1 2 (20)withd = 2, but with the integral term replaced with 1 1 g (u|y )g (u|y )du and g (u|y )g (1 − u|y )du (28) 1 1 2 2 1 1 2 2 0 0 under the upper and the lower Fréchet copula cases, respectively, where the g (·|y) in (28) are PDFs on (0, 1) given by −1 −1 −λ F (u) e F (u) g (u|y) =   , u ∈ (0, 1), y ∈ N , i = 1, 2. (29) i 0 −λ T i i E e T Again, while the integrals in (28) are rarely available explicitly, they can be easily approx- imated by Monte-Carlo simulations in order to compute the joint PMF of Y = (Y , Y ) . 1 2 These two “extreme" distributional cases can also be used to derive the full range of the val- ues for the correlation of Y = (Y , Y ) when the marginal distributions (19)are fixed, if 1 2 needed. 2.1 Mixed Poisson distributions with NB margins We now consider the case where the mixed Poisson marginal distributions of Y are NB, so that the marginal distributions of T are gamma (see Lemma 1 in Appendix). Thus, we shall assume that the coordinates of the random vectors T have univariate standard gamma distributions with shape parameters r ∈ R , i = 1, ... , d. There have been i + numerous multivariate gamma distributions developed over the years, and we could use any of them here. However, we follow a general approach based on copulas, discussed above. Thus, we assume that the dependence structure of T is governed by some copula function C(u , ... , u ), which admits the PDF c(u , ... , u ).Inthiscase, the f in (18)are 1 d 1 d i given by (6)where r = r and the F are the corresponding CDFs. Here, the marginal PMFs i i of the {Y } in (19)are givenby (y + r ) r y P(Y = y) = p (1 − p ) , y ∈ N , (30) i i 0 (r )y! where the NB probabilities are given by p = 1/(1 + λ ) ∈ (0, 1) (so that λ = (1 −p )/p > i i i i i 0). Further, the PDF of X in Proposition 1 is still given by (21), where the marginal PDFs g (·|y ) now admit explicit expressions i i y +r i i (1 + λ ) y +r −1 −(1+λ )t i i i g (t|y ) = t e , t ∈ R . (31) i i + (y + r ) i i We recognize that these are gamma PDFs. Thus, in this special case of multivariate mixed Poisson distributions of Type II with NB marginal distributions, the random vector X in the representation (14) has multivariate gamma distribution as well, but with independent margins. This fact is summarized in the result below. Corollary 1 Let Y have a mixed Poisson distribution defined viz. (15), where the {N (·)} are independent Poisson processes with respective rates λ and T has multivariate gamma i (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 8 of 21 distribution with standard gamma margins with shape parameters r and CDFs F , gov- i i erned by a copula PDF c(u).Then, themarginalPMFsof Y are given by (30)with p = 1/(1 + λ ) ∈ (0, 1) and its joint PMF is given by (14), where X = (X , ... , X ) has i i 1 d multivariate gamma distribution with independent gamma marginal distributions of the {X } with PDFs given by (31). Remark 3 If the expectation in (14) does not admit an explicit form in terms of the y , ... , y , one can approximate its value viz. straightforward Monte-Carlo approximation 1 d involving random variate generation of independent gamma random variates {X }. Let us note that since the {T } have standard gamma distributions with shape param- eters r ,wehave E(T ) = Var(T ) = r , and an application of Lemma 9 leads to the i i i i following result. Proposition 2 Let Y have a mixed Poisson distribution defined viz. (15), where the {N (·)} are independent Poisson processes with respective rates λ and T has multivariate i i gamma distribution with standard gamma margins with shape parameters r and CDFs F , governed by a copula PDF c(u).Then, E(Y) = I(λ)r,where r = (r , ... , r ) and I(λ) i 1 d is a d × d diagonal matrix with the {λ } on the main diagonal. Moreover, the covariance matrix of Y is given by = I(λ)I(r) + I(λ) I(λ) , Y T where  is the covariance matrix of T and I(r) is a d × d diagonal matrix with the {r } T i on the main diagonal. Remark 4 The correlation of Y and Y is still given by (25), where this time i j λ λ c = , i = j, i,j 1 + λ 1 + λ i j since in (26) we have E(T ) = Var(T ). Let us note that while in principle the quantities i i c can assume any value in (0, 1) when we choose appropriate λ and λ ,theyare fixedfor i,j i j particular marginal NB distributions, since in this model the NB probabilities are given by p = 1/(1 + λ ). In the terms of the latter, we have i i c = 1 − p 1 − p , i = j. i,j i j These quantities, along with the full range of correlations for ρ in (25), can be used T ,T i j to obtain the upper and lower bounds for possible correlations of Y and Y in this model. i j We note that the possible range of ρ depends on the shape parameters r and r .Ifthe T ,T i j i j {T } are exponential (so that r = r = 1), then the upper limit of their correlation can be i i j shown to be 1. However, the full range for the correlation of T and T is usually a subset i j of [ −1, 1], which can be approximated by Monte-Carlo simulations (see Remarks 1-2) or other approximate methods (see, e.g., Demitras and Hedeker (2011)). 2.2 Simulation One particular way of defining this model, convenient for simulations, is by using the Gaussian copula to generate T. This is a very popular methodology due to its flexibility and ease of simulating from a required multivariate normal distribution. The Gaussian (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 9 of 21 copula is one that corresponds to a multivariate normal distribution with standard nor- mal marginal distributions and covariance matrix R. Since the marginals are standard normal, this R is also the correlation matrix. If F is the CDF of such multivariate normal distribution, then the corresponding Gaussian copula C is defined through F (x , ... , x ) = C ((x ), ... , (x )), R 1 d R 1 d where (·) is the standard normal CDF. Note that the copula C is simply the CDF of the random vector ((X ), ... , (X )) ,where (X , ... , X ) ∼ N (0, R). If the distribution 1 d 1 d d is continuous (so that R is non-singular), the copula C admits the PDF c ,given by R R 1 1 −1 T −1 −1 − ( (u)) (R −I ) (u)  d c (u , ... , u ) = e , u = (u , ... , u ) ∈[0,1] , R 1 d 1 d 1/2 |R| (32) −1 −1 −1 where  (u) = ( (u ), ... ,  (u )) and I is d × d identity matrix. This c will 1 d d R then be used in equations (20), (23), and (14). Simulation of multivariate gamma T with margins F based on this copula is quite simple, and involves the following steps: (i) Generate X = (X , ... , X ) ∼ N (0, R); 1 d d (ii) Transform X to U = (U , ... , U ) viz U = (X ), i = 1, ... , d; 1 d i i −1 (iii) Return T = (T , ... , T ) ,where T = F (U ), i = 1, ... , d; 1 i i Remark 5 This strategy of using Gaussian copula to generate multivariate distributions is quite popular indeed, and it became known in the literature as the NORTA (NORmal To Anything) method (see, e.g., Chen (2001); Song and Hsiao (1993)). This methodology has been recently used to generate multivariate discrete distributions, see, e.g., Barbiero and Ferrari (2017), Madsen and Birkes (2013), or Nikoloulopoulos (2013) and references therein. The standard approach discussed in these papers proceeds by simulating the vector U from the Gaussian copula following the steps (i) - (ii) above and then transforming the coordinates of U directly viz. the inverse CDFs of the components of the target random vector Y = (Y , ... , Y ) ,whichcanbedescribed as 1 d −1 (iii)’ Return Y = (Y , ... , Y ) , where Y = G (U ), i = 1, ... , d; 1 d i i Here, the G are the CDFs of the Y . If the distributions of the Y are discrete (such as NB), i i i the inverse CDF is defined in the standard way as −1 G (u) = inf{y : G(y) ≥ u}. The difference of our approach and the one discussed in the literature as described above is in the final step, regardless of the particular copula c that is used. In the standard approach one first simulates random U from c and then proceeds viz. (iii)’ above to get the target random vector Y (having a multivariate distribution with CDFs G ). On the other hand, our proposal is to first generate T viz. step (iii) above and then obtain the target variable viz. (15). While our methodology involves an extra step compared with this direct method, it offers a simple way of calculating the joint probabilities, which is not available in the other approach. Additionally, our methodology offers a stochastic explanation of the resulting distributions viz. mixing mechanism and its relation to the underlying Poisson processes, which is lacking in the somewhat artificial standard approach. Another advan- tage of the approach viz. mixed Poisson are possible extensions to more general stochastic (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 10 of 21 processes in the spirit of the NB process studied by Kozubowski and Podgórski (2009). On the other hand, its disadvantage is the fact that not all discrete marginal distributions can be obtained, only those that are mixed Poisson to begin with. Remark 6 Let us note that the mixed Poisson approach to generate multivariate distri- butions was used in Madsen and Dalthorp (2007), where Y was obtained viz. (15)with standard Poisson processes and where T = e with X being multivariate normal with mean μ = (μ , ... , μ ) and covariance matrix  =[ σ ]. Since in this case the marginals 1 d i,j of T have log-normal distributions, the authors referred to this construction as lognormal- Poisson hierarchy. This can be seen as a special case our scheme, where we have λ = e and the marginal CDFs of T of the form F (t) = (log t /σ ). The copula PDF of the {T } i i i ii i is the Gaussian copula (32)where R is the correlation matrix corresponding to . An important aspect of this problem is how to set the parameters of the underlying copula function so that the distribution of Y has given characteristics, such as the means and the covariances (and correlations). In the case where a Gaussian copula is used, this has to do with determining the correlation matrix R. This problem arises in the general scheme (i)—(iii) as well — and has been discussed in the literature (see, e.g., Barbiero and Ferrari (2017); Xiao (2017); Xiao and Zhou (2019)). Generally, there is no simple relation between R and the correlation matrix of T in (i)—(iii). However, other measure of associations - such as Kendall’s τ or Spearman’s ρ do transfer directly and may be preferred to use in our set-up. These issues will be the subject of further research. 3 Examples We provide two examples. The first example describes the T-Poisson hierarchy approach to construct a multivariate geometric distribution. Second, we demonstrate how the T- Poisson hierarchy can be used to conduct a high-dimensional (d = 1026) simulation study inspired by RNA-sequencing data — a challenging computational task. 3.1 Multivariate geometric distributions Suppose that the random vector T in (15) has marginal standard exponential distribu- tions, so that the marginal CDFs of the {T } are of the form −t F (t) = 1 − e , t ∈ R . (33) i + In this case, the {Y } have geometric distributions with parameters p = 1/(1 + λ ),sothat i i i P(Y = y) = p (1 − p ) , y ∈ N . (34) i i i 0 One can then obtain a multitude of multivariate distributions with geometric margins by selecting various copulas for the underlying distributions of T. As an example, con- sider the case with Farlie-Gumbel-Morgenstern (FGM) copula driven by a parameter θ ∈[ −1, 1], given by ⎛ ⎞ d d ⎝ ⎠ C(u) = u 1 + θ (1 − u ) , u = (u , ... , u ) ∈[0,1] . (35) i i 1 i=1 i=1 (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 11 of 21 Consider a two dimensional case with d = 2, where the PDF corresponding to (35)isof the form c(u) = 1 + θ(1 − 2u )(1 − 2u ), u = (u , u ) ∈[0,1] . (36) 1 2 1 2 In this case, the random vector X = (X , X ) in Corollary 1 has independent gamma 1 2 margins (31) with shape parameters y + 1 and scale parameters 1 + λ , i = 1, 2. Using this i i fact, coupled with (33), one can evaluate the expectation in (14), leading to y +1 y +1 1 2 1 1 E {c(F (X ), F (X ))} = 1 + θ 1 − 2 1 − 2 . (37) 1 1 2 2 1 + p 1 + p 1 2 In view of Corollary 1, this leads to the following expression for the joint probabilities of bivariate geometric distribution defined by our scheme viz. FGM copula: 2 2 y +1 y  2 P(Y = y) = p (1 −p ) 1 + θ 1 − 2 , y = (y , y ) ∈ N . i i 1 2 1 + p i=1 i=1 (38) We shall denote this distribution by GEO(p , p , θ).When θ = 0, the {Y } are independent 1 2 i geometric variables with parameters p ∈ (0, 1), i = 1, 2. Otherwise, Y , Y are correlated, i 1 2 with θ 1 − p 1 − p 1 2 Cov(Y , Y ) = , (39) 1 2 4 p p 1 2 as can be verified by routine, albeit tedious, algebra. In turn, the correlation of Y , Y 1 2 becomes ρ = 1 − p 1 − p , (40) Y ,Y 1 2 1 2 and can generally take any value in the range (−1/4, 1/4). 3.2 Simulating RNA-seq data This section describes how to simulate data using a T-Poisson hierarchy, aiming to replicate the structure of high-dimensional dependent count data. In fact, simulating RNA-sequencing (RNA-seq) data is a one of the primary motivating applications of the proposed methodology, seeking scaleable Monte Carlo methods for realistic multivariate simulation (for example, see Schissler et al. (2018)). The RNA-seq data generating process involves counting how often a particular messen- ger RNA (mRNA) is expressed in a biological sample. Since this is a counting process with no upper bound, many modeling approaches use discrete random variables with infinite support. Often the counts exhibit over-dispersion and so the negative binomial arises as a sensible model for the expression levels (gene counts). Moreover, the counts are corre- lated (co-expressed) and cannot be assumed to behave independently. RNA-seq platforms quantify the entire transcriptome in one experimental run, resulting in high-dimensional data. In humans, this results in count data corresponding to over 20,000 genes (cod- ing genomic regions) or even over 77,000 isoforms when alternating spliced mRNA are counted. This suggests simulating high-dimensional multivariate NB with heterogeneous marginals would be useful tool in the development and evaluation of RNA-seq analytics. In an illustration of our proposed methodology applied to real data, we seek to simulate RNA-sequencing data by producing simulated random vectors generated from the Type II (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 12 of 21 T-Poisson framework (as in Eq. (13)). Our goal is to replicate the structure of a breast can- cer data set (BRCA: breast cancer invasive carcinoma data set from The Cancer Genome Atlas). For simplicity, we begin by filtering to retain the top 5% highest expressing genes of the 20,501 gene measurements from N = 1212 patients’ tumor samples, resulting in d = 1026 genes. All these genes exhibit over-dispersion and, so, we proceed to estimate the NB parameters (r , p ), i = 1, ... , d, to determine the target marginal PMFs g (y ) (via i i i i −6 −2 method of moments). Notably, the p s are small — ranging in [ 3.934×10 , 1.217×10 ]. To complete the simulation algorithm inputs, we estimate the Pearson correlation matrix R and set that as the target correlation. With the simulation targets specified, we proceed to simulate B = 10, 000 random vec- tors Y = (Y , ... , Y ) with target Pearson correlation R and marginal PMFs g (y ) using 1 Y i i a T-Poisson hierarchy of Kind II. Specifically, we first employ the direct Gaussian copula approach to generate B random vectors following a standard multivariate Gamma distri- bution T with shape parameters r equal to the target NB sizes and Pearson correlation matrix R . Care must be taken when setting the specifying R (refer to Eq. (32)) — we employ Eq. (25) to compute the scaling factors c and adjust the underlying correlations i,j to ultimately match the target R . Notably, of the 525,825 pairwise correlations from the 1026 genes, no scale factor was less than 0.9907, indicating the model can produce essen- tially the entire range of possible correlations. Here we are satisfied with approximate matching of the specified Gamma correlation and set R = R in our Gaussian copula scheme (R indicating the specified multivariate Gaussian correlation matrix). Finally, we generate the desired random vector Y = N (T ) by simulating Poisson counts with i i i (1−p ) expected value μ = λ × T ,for i = 1, ... , d,(with λ = )and repeat B = 10, 000 i i i i times. Fig. 1 The T-Poisson strategy produces simulated random vectors from a multivariate negative binomial (NB) that replicate the estimated structure from an RNA-seq data set. The dashed red lines indicated equality between estimated parameters (vertical axes; derived from the simulated data) and the specified target parameters (horizontal axes) (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 13 of 21 Figure 1 shows the results of our simulation by comparing the specified target parame- ter (horizontal axes) with the corresponding quantities estimated from the simulated data (vertical axes). The evaluation shows that the simulated counts approximately match the target parameters and exhibit the full range of estimated correlation from the data. Utiliz- ing 15 CPU threads in a MacBook Pro carrying a 2.4 GHz 8-Core Intel Core i9 processor, the simulation completed in less than 30 seconds. Appendix Gamma-Poisson mixtures For the convenience of the reader, we include a short proof of the well-known fact stating that Poisson distribution with gamma-distributed parameter is NB (see, e.g., Solomon (1983)). Lemma 1 If {N (t), t ∈ R } is a homogeneous Poisson process with rate λ = (1 − p)/p > 0 and T is an independent standard gamma variable with shape parameter r, then the randomly stopped process, Y = N (T ), has a NB distribution NB(r, p) with the PMF (7). Proof Suppose that T has a standard gamma distribution with the PDF (6)and the corresponding CDF F . When we substitute the latter into (5), we obtain −λt n e (λt) 1 r−1 −t P(Y = n) = t e dt. n! (r) After some algebra, this produces n n+r (n + r) λ (1 + λ) n+r−1 −t(1+λ) P(Y = n) = t e dt. n+r (r)n! (1 + λ) (n + r) Since the integrand above is the PDF of gamma distribution with shape n + r and scale 1 + λ, the integral becomes 1 and we have r n (n + r) 1 λ P(Y = n) = , (r)n! 1 + λ 1 + λ −1 whichwerecognizeas theNBprobability from (7)with p = (1 + λ) . The result follows when we set λ = (1 − p)/p in the above analysis. Mixed multivariate Poisson distributions of type I Here we provide basic distributional facts about mixed multivariate Poisson distributions of Type I, which are the distributions of Y = (Y , ... , Y ) = (N (T ), ... , N (T )) , 1 d 1 d where the {N (·)} are independent Poisson processes with rates λ and T is a random i i variable on R , independent of the {N }. + i Lemma 2 In the above setting, the PGF of Y is given by ⎧ ⎫ ⎛ ⎞ d d d ⎨ ⎬ i  d ⎝ ⎠ G(s) = E s = φ λ − λ s , s = (s , ... , s ) ∈[0,1] , T i i i 1 d ⎩ ⎭ i=1 i=1 i=1 where φ is the LT of T. T (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 14 of 21 Proof By using standard conditioning argument, we have ⎧ ⎫ ⎧ " ⎫ d d ⎨ ⎬ ⎨ ⎬ Y Y i i G(s) = E s = E s T = t dF (t). (41) i i " ⎩ ⎭ ⎩ ⎭ i=1 i=1 Since given T = t the variables {Y } are independent and Poisson distributed with means {λ t}, respectively, we have ⎧ " ⎫ d d d ⎨ ⎬ d d " # " $ −t λ − λ s Y i i i i Y −λ t(1−s ) i=1 i=1 " i" i i E s T = t = E s T = t = e = e . i " ⎩ ⎭ i=1 i=1 i=1 When we substitute the above into (41) we conclude that the PGF of Y is indeed of the form stated above. Remark 7 Note that in the dimensional case d = 1, we recover the well-known formula for the PGF of Y = N (T ), G(s) = φ (λ(1 − s)), s ∈[0,1], (42) where λ> 0 is the rate of the Poisson process {N (t), t ∈ R }. If we further assume that T is standard gamma distributed with shape parameter r > 0,sothat φ (t) = , t ∈ R , T + 1 + t and we take λ = (1 − p)/p, we obtain G(s) = , s ∈[0,1]. (43) 1 − (1 − p)s We recognize this as the PGF of the NB distribution NB(r, p), as it should be according to Lemma 1. Similarly, the PGF of a d-dimensional mixed Poisson distribution with such a gamma distributed T takes on the form % & G(s) = , s = (s , ... , s ) ∈[0,1] , 1 d Q − P s i i i=1 where P = λ and Q = 1 + P . This is a general form of multivariate negative i i i i=1 multinomial distribution (see Chapter 36 of Johnson et al. (1997)). Since the PGF of the −1 marginal distributions of Y in this setting is of the form (43)withp = (1 + λ ) ,all i i marginal distributions are NB. Due to this property, discrete multivariate distributions with the above PGFs have been termed multivariate NB distributions (for more details, see Johnson et al. (1997)). Remark 8 Let us note that changing a scaling factor of the variable T in this model has the same effect as adjusting the rate parameters connected with the Poisson processes {N (·)}. Namely, it follows from Lemma 2 that if we let T = cT in the above setting, then we have the following equality in distribution: ' ( ˜ ˜ ˜ ˜ N T , ... , N T = N (T ), ... , N (T ) , (44) 1 1 d d where the {N (·)} are independent Poisson processes with rates cλ , respectively. Thus, with- i i out loss of generality, we may assume that the scale parameter of the variable T in this model is set to unity. (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 15 of 21 Lemma 3 In the above setting, the PMF of Y is given by P(Y = y) = g (y )h(y), y = (y , ... , y ) ∈ N , i i 1 d i=1 where g (y) = v (y, λ ), y ∈ N , i T i 0 y! are the marginal PMFs of the {Y }, y −λT v (y, λ) = E T e , λ, y ∈ R , T + and the function h is given by d d v y , λ T i i i=1 i=1 h(y) = , y = (y , ... , y ) ∈ N . ) 1 d v (y , λ ) T i i i=1 Proof Since given T = t the variables {Y } are independent and Poisson distributed with means {λ t}, respectively, by using standard conditioning argument, followed by some algebra, we have ⎡ ⎤ ⎡ ⎛ ⎞ ⎤ d y d y d d i i λ d d λ i −t λ y i i i i=1 i=1 ⎣ ⎦ ⎣ ⎝ ⎠ ⎦ P(Y = y) = e t dF (t) = v y , λ . T T i i y ! y ! i R i i=1 i=1 i=1 i=1 (45) Similarly, the marginal PMFs are given by y y λ λ i −tλ y i P(Y = y) = e t dF (t) = v (y, λ ) . (46) i T T i y! y! By combining (45)and (46), we obtain the result. Remark 9 Note that the joint PMF of Y canbealsowrittenas ⎛ ⎞ d d d y ⎝ ⎠ P(Y = y) = v y , λ , (47) T i i y ! i=1 i=1 i=1 which is a convenient expression for approximating these probabilities by Monte Carlo simulations if the function v (·, ·) is not available explicitly and the random variable T is straightforward to simulate. We also note that whenever the marginal PMFs of Y are explicit, then so is the function v (·, ·), which is clear from Lemma 3. For example, if T is standard gamma with shape parameter r, then we have r+y (r + y) 1 y! v (y, λ) = = P(Y = y), λ, y ∈ R , T + (r) 1 + λ λ where Y has a NB distribution with parameters r and p = 1/(1 + λ). Next, we present an alternative expression for the joint probabilities P(Y = y), which provides a convenient formula for their computation whenever the variable T is difficult to simulate but its PDF is easy to compute. This representation involves a multinomial random vector N = (N , ... , N ) with parameters n and p = (p , ... , p ) , denoted 1 d 1 d (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 16 of 21 by MUL(n, p),where n ∈ N represents thenumberoftrials, the {p } represent event probabilities that sum up to one, and ⎧ ⎫ ⎨ ⎬ n! y y 1 d d d P(N = y) = p ··· p , y ∈ k = (k , ... , k ) ∈ N : k = n . (48) 1 i 1 d 0 ⎩ ⎭ y ! ··· y ! i=1 Lemma 4 In the above setting, the PMF of Y is given by ' ( P(Y = y) = P(N = y)E f (W ) , y = (y , ... , y ) ∈ N , (49) λT 1 d d where λ = λ , N ∼ MUL(n, p) with n = y and p = λ /λ,the quantity f is i i i i λT i=1 i=1 the PDF of λT, and W has standard gamma distribution with shape parameter n + 1. Proof Proceeding as in the proof of Lemma 3,weobtain d y n+1 λ n! 1 λ i (n+1)−1 −λt P(Y = y) = t e f (t)dt. (50) y ! λ λ n! i R i=1 Since the integrand is the product of f (t) and the density of gamma random variable X with shape parameter n + 1and scale λ,wehave . / .  / n+1 1 λ 1 1 W (n+1)−1 −λt t e f (t)dt = E f (X) = E f , T T T λ n! λ λ λ where W = λX has standard gamma distribution with shape parameter n + 1(andscale 1). To conclude the result, observe that the expression d y λ n! y ! λ i=1 in (50) coincides with the multinomial probablity (48)with p = λ /λ while i i 1 w f = f (w). T λT λ λ Remark 10 Note that in one dimensional case where d = 1 the multinomial probability in (49) reduces to 1, and we obtain ' ( P(Y = y) = E f (W ) , y ∈ N , (51) λT 0 where Y = N (T ), {N (t), t ∈ R } is a Poisson process with rate λ,the quantity f is the + λT PDF of λT, the variable W has standard gamma distribution with shape parameter y + 1, and T is independent of the Poisson process. Finally, we present well-known results concerning the mean and the covariance struc- ture of mixed multivariate Poisson distributions of Type I, which are easily derived through standard conditioning arguments. Generally, whenever the mean of T exists then so does the mean of each Y ,and we have E(Y ) = λ E(T ). Moreover, the vari- i i i ance of each Y is finite whenever T has a finite second moment, in which case we have Var(Y ) = λ E(T ) + λ Var(T ).Thus, thevarianceof Y is greater than the mean, and the i i i distribution of Y is over-dispersed. Finally, under the latter assumption, the covariance of Y and Y exists and equals Cov(Y , Y ) = λ λ Var(T ). The result below summarizes these i j i j i j facts. (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 17 of 21 Lemma 5 In the above setting, the mean vector of Y exists whenever the mean of T is finite, in which case we have E(Y) = λE(T ),where λ = (λ , ... , λ ) . Moreover, if T has a finite second moment, then the covariance matrix of Y is well defined and is given by = E(T )I(λ) + Var(T )λλ , where I(λ) is a d × d diagonal matrix with the {λ } on the main diagonal. Remark 11 By the above result, if it exists, the correlation coefficient of Y and Y is given i j by λ λ i j ρ =   . i,j E(T ) E(T ) λ + λ + i j Var(T ) Var(T ) The correlation is always positive, and can generally fall anywhere within the boundaries of zero and one. Mixed multivariate Poisson distributions of type II Here we provide basic distributional facts about mixed multivariate Poisson distributions of Type II, which are the distributions of Y = (Y , ... , Y ) = (N (T ), ... , N (T )) , 1 d 1 1 d d where the {N (·)} are independent Poisson processes with rates λ and T = (T , ... T ) i i 1 d is a random vector in R with the PDF f , independent of the {N }. T i Lemma 6 In the above setting, the PGF of Y is given by ⎧ ⎫ ⎨ ⎬ i  d G(s) = E s = φ (I(λ)(1 − s)), s = (s , ... , s ) ∈[0,1] , (52) T 1 ⎩ ⎭ i=1 where φ is the LT of T, I(λ) is a d × d diagonal matrix with the {λ } on the main diagonal, T i and 1 is a d-dimensional column vector of 1s. Proof By using standard conditioning argument, we have ⎧ ⎫ ⎧ " ⎫ d d ⎨ ⎬ ⎨ ⎬ Y Y i i G(s) = E s = E s T = t dF (t). (53) i i " ⎩ ⎭ d ⎩ ⎭ i=1 i=1 Since given T = t the variables {Y } are independent and Poisson distributed with means {λ t }, respectively, we have i i ⎧ " ⎫ d d d ⎨ ⎬ " # " $ Y −λ t (1−s ) −t I(λ)(1−s) i i i j j " " E s T = t = E s T = t = e = e . ⎩ ⎭ i=1 i=1 j=1 When we substitute the above into (53) we conclude that the PGF of Y is as stated in the lemma. Remark 12 Note that the expression (52) is a generalization of (42)tothe multivariate case of mixed Poisson. Additionally, observe that if the components of T coincide, that is T = Tfor i = 1, ... d, we have −t T −(t +···+t )T 1 d φ (t) = E e = E e = φ (t + ··· + t ), T T 1 d and the PGF in (52) reduces to its counterpart provided in Lemma 2,asitshould. (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 18 of 21 Remark 13 Let us note that changing scaling factors of the variables T in this model has the same effect as adjusting the rate parameters connected with the Poisson processes {N (·)}. Namely, it follows from Lemma 6 that if we let T = c T in the above setting, then i i i i we have the following equality in distribution: ' ( ˜ ˜ ˜ ˜ N (T ), ... , N (T ) = N (T ), ... , N (T ) , (54) 1 1 d d 1 1 d d where the {N (·)} are independent Poisson processes with rates c λ , respectively. Thus, i i i without loss of generality, we may assume that the scale parameters of the variables T in this model are set to unity. Next, we provide a convenient formula for the PMF of multivariate mixed Poisson dis- tributions of Type II, which is an extension of that given in Lemma 3. To state the result, we extend the definition of the function v described by (12) to vector-valued arguments d   d and random vectors T in R . Namely, for a = (a , ... , a ) , b = (b , ... , b ) ∈ R we 1 d 1 d + + set b b a = a (55) i=1 and define y −λ T d v (y, λ) = E T e , λ, y ∈ R . (56) Lemma 7 In the above setting, the PMF of Y is given by P(Y = y) = g (y )h(y), y = (y , ... , y ) ∈ N , i i 1 d i=1 where g (y) = v (y, λ ), y ∈ N , i T i 0 y! are the marginal PMFs of the {Y } and the function h is given by v (y, λ) h(y) = , y = (y , ... , y ) ∈ N . ) 1 d v (y , λ ) T i i i=1 i Proof By using standard conditioning argument, we have P(Y = y) = P(N (T ) = y , ... , N (T ) = y |T = t)f (t)dt, (57) 1 1 1 d d d T where y = (y , ... , y ) and t = (t , ... , t ) . Further, by independence, we have 1 d 1 d P(N (T ) = y , ... , N (T ) = y |T = t) = P(N (t ) = y ). (58) 1 1 1 d d d i i i i=1 Since the N (t ) are Poisson with parameters λ t ,wehave i i i i −λ t y i i i e (λ t ) i i P(N (t ) = y ) = , i = 1, ... , d. (59) i i i y ! When we now substitute (58)-(59)into(57), then after some algebra we get ⎡ ⎤ y y d d d i i λ d λ i − λ t i i i i i=1 ⎣ ⎦ P(Y = y) = e t f (t)dt = v (y, λ) . (60) T T y ! y ! i i i=1 i=1 i=1 (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 19 of 21 Similarly, the marginal PMFs are given by y y λ λ i −tλ y i P(Y = y) = e t dF (t) = v (y, λ ) . (61) i T T i i i y! y! By combining (60)and (61), we obtain the result. We now present an alternative expression for the joint probabilities P(Y = y), which facilitates their computation if the random vector T is difficult to simulate but its PDF is readily available. Lemma 8 In the above setting, the PMF of Y is given by ' ( P(Y = y) = E f (W) , y = (y , ... , y ) ∈ N , (62) I(λ)T 1 d where the quantity f is the PDF of I(λ)T = (λ T , ... , λ T ) and W = I(λ)T 1 1 d d (W , ... , W ) with mutually independent W having standard gamma distributions with 1 d i shape parameters y + 1. Proof Proceeding as in the proof of Lemma 4,weobtain d d y +1 1 λ (y +1)−1 i i −λ t i i P(Y = y) = t e f (t)dt. (63) λ y ! i R i i=1 i=1 Note that the product under the integral above is the PDF of X = (X , ... , X ) ,where 1 d the X are mutually independent gamma random variables with shape parameters y + 1 i i and scale parameters λ . This allows us to conclude that ⎡ ⎤ ⎡ ⎤ d d 1 1 W W ⎣ ⎦ ⎣ ⎦ P(Y = y) = E f (X) = E f , ... , , T T λ λ λ λ i i 1 i=1 i=1 where W = (W , ... , W ) = I(λ)X has independent standard gamma components with 1 d shape parameters y + 1. To conclude the result, observe that 1 W W 1 d f , ... , = f (W). T I(λ)T λ λ λ i 1 d i=1 Finally, let us summarize standard results concerning the mean and the covariance structure of mixed multivariate Poisson distributions of Type II, which parallel the results for Type I, and are easily derived through standard conditioning arguments. Gener- ally, whenever the means of {T } exist then so do the means of the {Y },and we have i i E(Y ) = λ E(T ). Similarly, the variance of each Y is finite whenever T has a finite second i i i i i moment, in which case we have Var(Y ) = λ E(T ) + λ Var(T ). Again, the distribution i i i i of Y is always over-dispersed. Finally, for any i = j, the covariance of Y and Y exists and i i j equals Cov(Y , Y ) = λ λ Cov(T , T ) whenever the covariance of T and T exists. These i j i j i j i j facts are summarized in the result below. Lemma 9 In the above setting, the mean vector of Y exists whenever the mean of T is finite, in which case we have E(Y) = I(λ)E(T),where λ = (λ , ... , λ ) and I(λ) is a d ×d 1 d (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 20 of 21 diagonal matrix with the {λ } on the main diagonal. Moreover, if T has a finite covariance matrix  then the covariance matrix of Y is well defined as well and is given by = I(λ)I(E(T)) + I(λ) I(λ) , Y T where I(E(T)) is a d × d diagonal matrix with the diagonal entries {E(T )}. Abbreviations BRCA: Breast invasive carcinoma; CDF: Cumulative distribution function; FGM: Farlie-Gumbel-Morgenstern; L-P model: lognormal-Poisson model; mRNA: messenger ribonucleic acid; NB: Negative binomial; NORTA: NORmal To Anything; PDF: Probability density functions; PGF: Probability generating function; PMF: Probability mass function; RNA-seq: RNA-sequencing Acknowledgements The authors thank the two reviewers for their comments that help improve the paper. We also thank Professors Walter W. Piegorsch and Edward J. Bedrick (University of Arizona) for their helpful discussions. Authors’ contributions AGS and TJK conceived the study. TJK, AKP, and AGS developed the approach. ADK and AGS conducted the computational analyses. TJK, AKP, and AGS wrote the manuscript. TJK, AKP, AGS, ADK revised the manuscript. All authors read and approved with the final document. Funding Research reported in this publication was supported by MW-CTR-IN of the National Institutes of Health under award number 1U54GM104944. Availability of data and materials The results shown here are in whole or part based upon data generated by the TCGA Research Network: https://www. cancer.gov/tcga. Code reproducing the BRCA data set and computational analyses is available from the corresponding author on reasonable request. Declarations Competing interests The authors declare that they have no competing interests. Received: 28 October 2020 Accepted: 2 March 2021 References Barbiero, A., Ferrari, P. A.: An R package for the simulation of correlated discrete variables. Comm. Statist. Simul. Comput. 46(7), 5123–5140 (2017) Chen, H.: Initialization for NORTA: Generation of random vectors with specified marginals and correlations. INFORMS J. Comput. 13(4), 257–360 (2001) Clemen, R. T., Reilly, T.: Correlations and copulas for decision and risk analysis. Manag. Sci. 45, 208–224 (1999) Demitras, H., Hedeker, D.: A practical way for computing approximate lower and upper correlation bounds. Amer. Statist. 65(2), 104–109 (2011) Johnson, N., Kotz, S., Balakrishnan, N.: Discrete Multivariate Distributions. Wiley, New York (1997) Karlis, D., Xekalaki, E.: Mixed Poisson distributions. Intern. Statist. Rev. 73(1), 35–58 (2005) Kozubowski, T. J., Podgórski, P.: Distribution properties of the negative binomial Lévy process. Probab. Math. Statist. 29, 43–71 (2009) Madsen, L., Birkes, D.: Simulating dependent discrete data. J. Stat. Comput. Simul. 83(4), 677–691 (2013) Madsen, L., Dalthorp, D.: Simulating correlated count data. Environ. Ecol. Stat. 14(2), 129–148 (2007) Nelsen, R. B.: An Introduction to Copulas (2006) Nikoloulopoulos, A. K.: Copula-based models for multivariate discrete response data. In: Copulae in Mathematical and Quantitative Finance, 231–249, Lect. Notes Stat., 213. Springer, Heidelberg, (2013) Nikoloulopoulos, A. K., Karlis, D.: Modeling multivariate count data using copulas. Comm. Statist. Sim. Comput. 39(1), 172–187 (2009) Schissler, A. G., Piegorsch, W. W., Lussier, Y. A.: Testing for differentially expressed genetic pathways with single-subject N-of-1 data in the presence of inter-gene correlation. Stat. Methods Med. Res. 27(12), 3797–3813 (2018) Solomon, D. L.: The spatial distribution of cabbage butterfly eggs. In: Roberts, H., Thompson, M. (eds.) Life Science Models Vol. 4, pp. 350–366. Springer-Verlag, New York, (1983) Song, W. T., Hsiao, L.-C.: Generation of autocorrelated random variables with a specified marginal distribution. In: Proceedings of 1993 Winter Simulation Conference - (WSC ’93), pp. 374–377, Los Angeles, (1993). https://doi.org/10. 1109/WSC.1993.718074 Xiao, Q.: Generating correlated random vector involving discrete variables. Comm. Statist. Theory Methods. 46(4), 1594–1605 (2017) (2021) 8:6 Knudson et al. Journal of Statistical Distributions and Applications Page 21 of 21 Xiao, Q., Zhou, S.: Matching a correlation coefficient by a Gaussian copula. Comm. Statist. Theory Methods. 48(7), 1728–1747 (2019) Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Journal

Journal of Statistical Distributions and ApplicationsSpringer Journals

Published: Mar 16, 2021

There are no references for this article.