Generalized fiducial inference on the mean of zero-inflated Poisson and Poisson hurdle models

Yixuan Zou; Jan Hannig; Derek S. Young

doi:10.1186/s40488-021-00117-0

Zou, Yixuan; Hannig, Jan; Young, Derek S.

2021-03-06 00:00:00

jan.hannig@unc.edu Department of Statistics and Zero-inflated and hurdle models are widely applied to count data possessing excess Operations Research, University of zeros, where they can simultaneously model the process from how the zeros were North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA generated and potentially help mitigate the effects of overdispersion relative to the Full list of author information is assumed count distribution. Which model to use depends on how the zeros are available at the end of the article generated: zero-inflated models add an additional probability mass on zero, while hurdle models are two-part models comprised of a degenerate distribution for the zeros and a zero-truncated distribution. Developing confidence intervals for such models is challenging since no closed-form function is available to calculate the mean. In this study, generalized fiducial inference is used to construct confidence intervals for the means of zero-inflated Poisson and Poisson hurdle models. The proposed methods are assessed by an intensive simulation study. An illustrative example demonstrates the inference methods. Keywords: Count data, Coverage probability, Data dispersion, Generalized confidence intervals, Zero-truncated poisson 1 Introduction The Poisson distribution is arguably one of the most commonly used models for count data. As such, a large number of inferential tools are available for Poisson-based models, such as forthe ratiooftwo Poissonrates (Guetal. 2008), Poisson regression mod- els (Cameron and Trivedi 1990), and Poisson point processes (Itô 2015). Assuming the Poisson as an underlying distribution for parametric modeling can be a fairly strong assumption since one must be willing to posit that their data are equi-dispersed. In prac- tice, count data almost ubiquitously demonstrate over-dispersion, which can be attributed to, for example, (spatio-)temporal dependency, unexplained heterogeneity, and/or excess zeros (Cameron and Trivedi 2013). One of the earliest papers to address the problem of excess zeros was (Mullahy 1986), who proposed a two-part model that permits a more flexible data-generating process: zeros are from a binomial distribution while positive values are from a truncated distri- bution. Such a model can accommodate under- and over-dispersion. The model using a © The Author(s). 2021 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. (2021) 8:5 Zou et al. Journal of Statistical Distributions and Applications Page 2 of 15 zero-truncated Poisson is often called the Poisson hurdle (PH) model. Later, the seminal paper of (Lambert 1992) extended this phenomenon of excess zeros to the count regres- sion setting, but also framed the problem differently with respect to how the zeros were generated. Specifically, a certain number of zeros are expected to be generated accord- ing to the assumed count distribution (random zeros) while the excess zeros are assumed to be generated from a separate, degenerate process (structural zeros). This framework results in a zero-inflated model, which is a two-component mixture model with one component for the assumed count distribution and the second component a degenerate distribution at zero. In the work of (Lambert 1992), the development was in the context of zero-inflated Poisson (ZIP) regression models. Regardless, both PH and ZIP mod- els accommodate the notion of excess zeros in a Poisson setting, but how the zeros are generated is treated differently under the two models. Moreover, both models tend to have comparable performance regarding goodness-of-fit measures, which underscores how the application should provide the guidance in determining the way the zeros are generated. The more complex data setting posed by zero-inflation opens the door to additional inference considerations, many coupled with their own challenges. For example, there is a bevy of score tests developed for testing the presence of zero-inflation in various count data settings; cf. van den Broek (1995); Janaskul and Hinde (2002); Janaskul and Hinde (2008); Cao et al. (2014); Todem et al. (2018). Bhattacharya et al. (2008)used a general Bayesian setup for detecting if zero-inflation is present in the data, however, it is challenging to justify the selection of the prior distribution. Score-based tests are also available for testing the presence of overdispersion, which can be caused by zero- inflation; cf. (Ridout et al. 2001; Hall and Berenhaut 2002;Dengand Paul 2005). With the exception of large-sample-based approaches for constructing confidence intervals on regression parameters in zero-inflated regression and hurdle regression models, there is no panacea for constructing reliable, accurate confidence intervals for other parameters in their non-regression counterparts, such as the population mean of univariate ZIP and PH distributions. Deriving confidence intervals for a more complex data setting, like the presence of excess zeros, is challenging in the frequentist setting. Typically, one resorts to normal- based theory, but finite sample properties can be highly unreliable. In particular, Wagues- pack et al. (2020) assessed Wald-based confidence intervals for the ZIP mean. Their simulation results showed more liberal results for smaller n. As an alternative, they pro- posed constructing a bootstrap-based confidence interval for the ZIP mean, which had coverage probabilities much closer to the nominal level. They also conducted a signed likelihood ratio test (SLRT) for testing the ZIP mean, which controlled the type I error rate satisfactorily. Bayesian approaches suffer from the challenge to justify the selection of the prior distribution, just like we noted with the work of Bhattacharya et al. (2008)earlier. Alternatively, one can consider fiducial inference as proposed by (Fisher 1935). Fiducial inference struggled to gain popularity among statisticians because of perceived deficien- cies in the general approach. However, later works have developed more sophisticated procedures coupled with rigorous theory to mitigate such criticisms, all while reflect- ing the core tenets of the fiducial paradigm. For example, (Weerahandi 1995)introduced generalized confidence intervals (GCIs) constructed by generalized pivotal quantities (GPQs), and (Hannig et al. 2006) further established the connection between GCI and (2021) 8:5 Zou et al. Journal of Statistical Distributions and Applications Page 3 of 15 the fiducial argument of Fisher. For the purposes of our study, we turn to generalized fiducial distributions as they often lead to attractive solution with asymptotically correct frequentist coverage levels. Moreover, many simulation studies have shown that general- ized fiducial solutions have very good small sample properties; see, for example, (Hannig 2009)and (E et al. 2008). There is also some work on fiducial approaches for discrete distributions. For example, (Mathew and Young 2013) developed fiducial tolerance inter- vals for functions of discrete random variables, while (Hannig et al. 2016)presented an extensive summary about computing the generalized fiducial distribution for parameters of some common discrete distributions. In this paper, we shall consider using the fiducial inference for the mean of ZIP and PH distributions. This paper is organized as follows. In Section 2, we give a brief sketch of generalized fiducial inference, with emphasis on the discrete data setting. In Section 3, we derive the respective fiducial distributions of the ZIP mean and PH mean. In Section 4,wepresent a numerical study to illustrate the good coverage probabilities of GCIs for the ZIP mean and PH mean constructed using fiducial inference, and demonstrate that the fiducial test has comparable performance to the SLRT when conducting the ZIP mean test. An anal- ysis of urinary tract infection data is presented in Section 5.InSection 6, we make some concluding remarks. 2 Generalized fiducial inference The aim of generalized fiducial inference is to define a distribution for parameters of inter- est that contains all of the information from data. Therefore, inference on the parameters can be made through this distribution. The tenet of generalized fiducial inference is to switch the role of the parameters and the data. We now briefly explain the philosophy of generalized fiducial inference. Suppose that data Y are generated through the structural equation Y = G(ξ, U),where ξ is a vector of parameters and U is some random variable with a known distribution inde- pendent of the parameter ξ. The structural equation can be regarded as a data generation process where the noise process U and the signal ξ will produce observed data Y.Hence, the distribution of Y can be determined via the structural equation given a fixed parame- ter ξ and the distribution U.After thedata Y are observed, we can switch the position of the data and parameters by solving the structural equation conditioned on that the solu- tion to that equation exists. Thus, we can get ξ = Q(Y , U). For more details regarding this setup, we refer to (Hannig 2009). 2.1 Generalized fiducial inference on discrete data Let Y now be a discrete random variable with the distribution function F(·|θ).Weknow that if U ∼ U (0, 1), data following the distribution F(·|θ) can be generated through − − Y = F (U|θ),where F (a|θ) = inf{y : a ≤ F(y|θ)} is the inverse function. According to the philosophy of generalized fiducial inference, we need to solve the data generating equation to get the parameter as a function of the data and a known random distribu- tion. Assume for each fixed y, the distribution is a nonincreasing function of θ. It follows + − that Q (u) = sup{θ : F(y|θ) = u} and Q (u) = inf{θ : F(y |θ) = u} exist and y y + − satisfy F(y|Q (u)) = F(y |Q (u)) = u. Moreover, the closure of the inverse image is y y − + Q (u) =[ Q (u), Q (u)], where F(y |θ) is the left limit of the distribution function. Han- y − y y nig et al. (2016) chose a 50-50 mixture of the upper and lower bound as the generalized (2021) 8:5 Zou et al. Journal of Statistical Distributions and Applications Page 4 of 15 fiducial distribution for the parameter, so that the fiducial sample of the parameter has 50% chance from either the upper or lower bound. For the fiducial distributions of multiple parameters, a two-stage method can be applied based on the minimal sufficient statistics. Let the parameters of interest be ξ = (ξ , ξ ). 1 2 Assume that the following two conditions hold: 1If ξ is known, there is a statistic S = S (ξ ) that has an invertible pivotal 2 1 1 2 relationship with ξ . 2 A statistic S exists that S and ξ have an invertible pivotal relationship. 2 2 2 Then we can obtain the fiducial distribution of ξ followed by the fiducial distribution of ξ given that ξ is known. 1 2 3 Fiducial distributions for poisson data with excess zeros 3.1 Fiducial distribution of ZIP mean The ZIP distribution has probability mass function x −λ λ e p(x|π, λ) = πI (x) + (1 − π) I (x), {0} {N} x! where I (z) is the indicator function that z belongs to the set A. The following {A} proposition establishes the minimal sufficient statistic for a ZIP distribution: Proposition 3.1 Let X = (X , X , ... , X ) be a random sample from a ZIP distribu- 1 2 n tion. Denote the sum of the random sample as S and the number of zeros of the random n n sample as K, where S = X and K = I (X ). Consequently the minimal i {0} i i=1 i=1 sufficient statistic is (S, K ). Proof First we need to prove (S, K ) is sufficient. Let x = (x , x , ... , x ) be the 1 2 n realizations of X. The joint density of (X , ... , X ) is 1 n x −λ λ e p(x|π, λ) = πI (x ) + (1 − π) I (x ) {0} i {N} i x ! i=1 x −λ λ e −λ = [ π + (1 − π)e ] I (x ) + (1 − π) I + (x ) {0} i i {N } x ! i=1 x −λ λ e −λ I (x ) (1−I (x )) {0} i {0} i = [ π + (1 − π)e ] [ (1 − π) ] x ! i=1 I (x ) −λ {0} i i=1 π + (1 − π)e −λ n x (1−I x ) i {0} i i=1 = [ (1 − π)e ] λ (1 − π)e (1−I (x )) {0} i x ! i=1 I (x ) {0} i −λ i=1 π + (1 − π)e n −λ n x i=1 = [ (1 − π)e ] λ (1 − π)e (1−I (x )) {0} i × , x ! i=1 where N = N \{0}. According to the factorization theorem, (S, K ) is sufficient. (2021) 8:5 Zou et al. Journal of Statistical Distributions and Applications Page 5 of 15 Now we want to show (S, K ) is minimal sufficient. Assume that we have another sample T T Y = (Y , ... , Y ) with corresponding realizations y = (y , ... , y ) . The ratio of the two 1 n 1 n density functions is n n I (x )− I (y ) {0} i {0} i −λ i=1 i=1 n n p(x|π, λ) π + (1 − π)e x − y i i i=1 i=1 = λ f (x, y), p(y|π, λ) (1 − π)e where f is a function that does not depend on the parameters. The ratio is free of (π, λ) n n n n if and only if x , I (x ) = y , I (y ) .Hence (S, K ) is minimal i {0} i i {0} i i=1 i=1 i=1 i=1 sufficient. −λ It immediately follows that K ∼ Binomial(n, π + (1 − π)e ),and (S | K = k) has n−k the same distribution as Y ,where Y are independent Poisson(λ) random variables i i i=1 conditioned on the event {Y ≥ 1}. We also need the following proposition regarding sums of zero-truncated Poisson distributions: Proposition 3.2 Let Y , Y , ... Ym be independent Poisson(λ) random variables condi- 1 2 tioned on the event {Y ≥ 1}.Then ⎛ ⎞ m m λ m m−j k ⎝ ⎠ P Y = k = (−1) j λ m k! (e − 1) j j=1 j=0 λ m! S(k, m) = I (k), {m,m+1,...} λ m k! (e − 1) 1 m m m−j k where S(k, m) = (−1) j is the Stirling number of the second kind. j=0 m! j Proof The proof follows by mathematical induction, and can be found in Springael and Van Nieuwenhuyse (2006). Denote the distribution function of the sum of m zero-truncated Poisson(λ) by F (k|m, λ), the distribution of Poisson with mean parameter λ by F (k|λ),and the 1 P distribution function of Binomial(n, p) random variables by F (k|n, p). It follows that ⎛ ⎞ m m λj m e m−j ⎝ ⎠ F (k | m, λ) = P Y ≤ k = (−1) F (k|λj). 1 j P λ m j (e − 1) j=1 j=1 We will use the inverse distribution functions as a data generating equation: −1 −λ −1 K = F (U |n, π + (1 − π)e ) and S = F (U |n − K, λ), 1 2 where U , U are independent U (0, 1).When K = n, the value of S is set as 0. 1 2 After observing k and s, and inverting the data generating equation, we see that ∗ −λ ∗ ∗ ∗ B (U ) ≤ π + (1 −π)e ≤ B (U ) and H (U ) ≤ λ ≤ H (U ), k,n−k+1 k+1,n−k n−k,s−1 n−k,s 1 1 2 2 where B (u) is the quantile function of the Beta(a, b) distribution evaluated at u and a,b H (u) is the solution (in λ)ofthe equation F (s | m, λ) = u.Thus, thesamplefromthe m,s 1 ∗ ∗ fiducial distribution is obtained by sampling (U , U ), and using the above inequalities 1 2 to solve for π and λ. Consequently, when the parameter of interest is μ = (1 − π)λ,the mean of the ZIP distribution, we have ∗ ∗ ∗ ∗ H (U )(1 − B (U )) H (U )(1 − B (U )) n−k,s−1 k+1,n−k n−k,s k,n−k+1 2 1 2 1 ≤ μ ≤ , ∗ ∗ −H (U ) −H (U ) n−k,s−1 n−k,s 2 2 1 − e 1 − e (2021) 8:5 Zou et al. Journal of Statistical Distributions and Applications Page 6 of 15 if k < n.When k = n then 0 ≤ μ ≤∞. Finally, we need to select a representative region for the fiducial sample. Following (Hannig et al. 2016), we choose a 50-50 mixture of the upper and lower bound. In the case of k = n, this results in a 50-50 mixture of 0 and ∞. The algorithm for constructing a fiducial confidence interval for a ZIP mean is implemented as follows: Algorithm 1 1Let x = (x , x , ... , x ) be a sample of size n from a ZIP distribution with Poisson 1 2 n parameter λ and binomial parameter π. Calculate the number of zeros n n k = I (x ) and the sum of the sample s = x . {0} i i i=1 i=1 ∗ ∗ 2 Generate a realization U and U independently from U (0, 1). Then, calculate the 1 2 realizations ∗ ∗ H (U )(1−B (U )) n−k,s−1 k+1,n−k ⎨ 2 1 , if k < n −H (U ) n−k,s−1 1−e M = 0, if k = n ∗ ∗ H (U )(1−B (U )) n−k,s k,n−k+1 ⎨ 2 1 , if k < n −H (U ) ∗ n−k,s 1−e M = ∞, if k = n, where B (u) is the quantile function of the Beta(a, b) distribution evaluated at u a,b and H (u) is the solution (in λ) of the equation F (s | m, λ) = u, where F (s | m, λ) m,s 1 1 is the distribution function of the sum of m zero-truncated Poisson(λ). 3 Repeat step 2 B times, yielding 2B fiducial samples of the ZIP mean μ, denoted by ∗ ∗ ∗ ∗ M , ··· , M , M , ··· , M .The 1 − α two-sided GCI of the ZIP mean will be the L1 Ln U1 Un lower and upper α/2 quantiles of the fiducial samples. 3.2 Fiducial distribution of PH mean The derivation of the fiducial distribution for the PH model follows that for the ZIP model mutatis mutandis. The PH distribution has probability mass function: x −λ λ e p(x|π, λ) = πI (x) + (1 − π) I (x). {0} {N } −λ x! (1 − e ) Note that after reparameterization, a ZIP distribution characterized by the binomial parameter μ and the Poisson parameter λ can be expressed as a PH distribution that is 1 1 −λ characterized by the binomial parameter μ = μ + (1 − e ) and the truncated Pois- 2 1 son parameter λ = λ . Hence, the likelihoods of the two models are equivalent. The 2 1 selection of the model should be based on how zeros are generated. Note that this equiv- alency does not hold in the ZIP regression and PH regression settings. In those settings, the likelihoods are based on a conditional distribution (i.e., Y given some covariates) for determining the estimates of the regression parameters. While the final likelihoods will typically be similar, they will not be equal. The following proposition establishes the minimal sufficient statistic for a PH distribu- tion: (2021) 8:5 Zou et al. Journal of Statistical Distributions and Applications Page 7 of 15 Proposition 3.3 Let X = (X , X , ... , X ) be a random sample from a PH distribution. 1 2 n Denote the sum of the random sample as S and the number of zeros of the random sample n n as K, where S = X and K = I (X ). Consequently the minimal sufficient i {0} i i=1 i=1 statistic is (S, K ). Proof The proof is identical to the proof of Proposition 3.1, and is therefore omitted. Immediately, we see that K ∼ Binomial(n, π),and (S | K = k) has the same distribution n−k as Y ,where Y are independent Poisson(λ) random variables conditioned on the i i i=1 event {Y ≥ 1}. We will then use the inverse distribution functions as a data generating equation −1 −1 K = F (U |n, π) and S = F (U |n − K, λ), 1 2 B 1 where U , U are independent U (0, 1).When K = n the value of S is again set as 0. 1 2 After observing k and s, and inverting the data generating equation, we see that ∗ ∗ ∗ ∗ B (U ) ≤ π ≤ B (U ) and H (U ) ≤ λ ≤ H (U ), k,n−k+1 k+1,n−k n−k,s−1 n−k,s 1 1 2 2 where B (u) and H (u) are as defined for the ZIP setting. Thus, the sample from the a,b m,s ∗ ∗ fiducial distribution is obtained by sampling (U , U ) and using the above inequalities to 1 2 (1−π)λ solve for π and λ. Consequently, when the parameter of interest is μ = ,the mean −λ 1−e of the PH distribution, we have ∗ ∗ ∗ ∗ H (U )(1 − B (U )) H (U )(1 − B (U )) n−k,s−1 k+1,n−k n−k,s k,n−k+1 2 1 2 1 ≤ μ ≤ , ∗ ∗ −H (U ) −H (U ) n−k,s−1 2 n−k,s 2 1 − e 1 − e if k < n.When k = n then we again have 0 ≤ μ ≤∞. Thus, it turns out that the fiducial distribution of the mean of the ZIP and the mean of the PH are the same. Finally, just as in the ZIP setting, the selection of a representative region for the fiducial sample is to choose a 50-50 mixture of the upper and lower bound. In the case of k = n, this again results in a 50-50 mixture of 0 and ∞. The algorithm for constructing a GCI for a PH mean is the same as the ZIP setting, so it is omitted here. 4 Simulation study We next assess the performance of the GCI just presented through an extensive simula- tion study. We also compare our results to the bootstrap confidence intervals constructed using the approach in Waguespack et al. (2020). Note that we do not include a compari- son with the likelihood-based (i.e., Wald-based) confidence intervals since Waguespack et al. (2020) already demonstrated the relative superior performance of the bootstrap confi- dence intervals. The sample sizes used to assess the finite sample performance of the GCI include n ∈{15, 30, 100}. For the parameters, the mixture proportion π is selected from {0.2, 0.5, 0.8} and the mean λ of the Poisson distribution is selected from {1, 5}.The sim- ulation settings for the PH distribution are the same as for the ZIP distribution: sample sizes n ∈{15, 30, 100}, mixture proportions π ∈{0.2, 0.5, 0.8}, and mean of the Pois- −λ son distribution λ ∈{1, 5}. Note that when π = 0 in the ZIP setting or π = e in the PH setting, the data are actually simulated from the Poisson distribution Poisson(λ). Moreover, we demonstrate the performance of our approach when there is no under- /over-dispersion in the data. Specifically, the same values of n and λ are considered, but no (2021) 8:5 Zou et al. Journal of Statistical Distributions and Applications Page 8 of 15 mixing proportion is present (or equivalently π = 0). The number of Monte Carlo sam- ples for our simulations is set to 10,000 and the number of fiducial samples used is 1000. We also drew 10,000 bootstrap samples to construct the bootstrap confidence intervals. For each simulation scenario, we estimated the probability Q(X) = P(R (X)< M|X), where M is the mean of the distribution. If the generalized fiducial inference were exact, then Q(X) should follow a standard uniform distribution, which could be examined through Q − Q plots. In the results that follow, coverage probabilities and the median widths of the GCIs are reported. We report the median widths due to using a 50-50 mix- ture of 0 and ∞ as the fiducial distribution of λ, which has an expected value of ∞. Furthermore, we compare type I error rates and the power for both the SLRT proposed by Waguespack et al. (2020) and our fiducial test for testing the ZIP mean. The simulation is set up similarly as in Waguespack et al. (2020) to enable a side-by-side comparison: sam- ple sizes n ∈{30, 40, 50}, mixture proportions π ∈{0.1, 0.3, 0.5}, and mean of the Poisson distribution λ ∈{1, 1.3, 1.6, 2}. The first set of simulation results is for the ZIP distributions. The results are given in Table 1. As we can see, for the different simulation scenarios, the coverage probabilities for the bootstrap confidence intervals are typically liberal, especially for the sample sizes less than 100. Meanwhile, the coverage probabilities for the GCIs are all noticeably closer to the nominal level except for the sample size n = 15 and λ = 1. Under such a setting, the simulated samples are almost all zeros, thus compromising the inference. Even though the GCIs are conservative in this setting, they are closer to the nominal level compared to the more liberal bootstrap confidence intervals. The median widths are, of course, nar- rower for the bootstrap confidence intervals, but that is a result of the procedure being noticeably liberal relative to the nominal level. The Q − Q plots for the different sample sizes are given in Fig. 1. As the sample size n increases, the agreement between the actual p-value and the nominal p-value improves. The second set of simulation results is for the PH distributions. The results are given in Table 2. We again obtain similar results as in the ZIP setting for different simulation scenarios. The coverage probabilities are close to nominal for the GCIs, whereas the boot- strap confidence intervals are again noticeably liberal. In fact, the setting with sample size n = 15 and λ = 1 appears to be doing slightly better here in the PH setting com- pared to the analogous results in the ZIP setting. The Q − Q plots for different sample sizes are given in Fig. 2. The same asymptotic behavior identified in the ZIP setting is also observed from the present simulation results; specifically, as the sample size n increases, the agreement between the actual p-value and the nominal p-value improves. The third set of simulation results we consider is for the Poisson distribution. The results are given in Table 3. As noted earlier, the Poisson is just a special case of the ZIP and PH distributions such that there are no excessive zeros. Again, for different simula- tion scenarios, the coverage probabilities are close to nominal. For n = 15 and n = 30, there is a clear improvement using the GCIs compared to the bootstrap confidence inter- vals, but for large n, the procedures are comparable. This illustrates that regardless if zero-inflation is present in the data, our method is still appropriate for constructing a confidence interval of the mean. The Q − Q plots for different sample sizes are given in Fig. 3. Only moderate discrepancies are noticeable when the sample size is small (n = 15) or moderate (n = 30); however, the tail behavior appears to be very good. Since we want to construct a 95% confidence interval, it is not a concern as long as the tails are accurate, (2021) 8:5 Zou et al. Journal of Statistical Distributions and Applications Page 9 of 15 Table 1 Estimated coverage probabilities and median widths for the GCIs and bootstrap confidence intervals for the means generated from different ZIP distributions used in our simulation study Bootstrap Fiducial n λπ Cov. Prob. Med. Width Cov. Prob. Med. Width 15 0.2 0.905 0.867 0.961 1.037 0.5 0.902 0.800 0.978 0.973 0.8 0.853 0.400 0.980 0.933 0.2 0.926 2.733 0.959 2.839 5 0.5 0.918 2.800 0.952 2.915 0.8 0.880 2.067 0.974 2.455 30 0.2 0.920 0.667 0.953 0.716 0.5 0.912 0.567 0.960 0.649 0.8 0.878 0.367 0.976 0.521 0.2 0.934 1.967 0.948 2.002 5 0.5 0.936 2.067 0.952 2.079 0.8 0.918 1.533 0.957 1.636 100 0.2 0.943 0.380 0.950 0.384 0.5 0.941 0.330 0.950 0.343 0.8 0.936 0.230 0.958 0.248 0.2 0.950 1.100 0.952 1.101 5 0.5 0.944 1.150 0.947 1.147 0.8 0.943 0.870 0.952 0.876 Fig. 1 Q-Q plots for ZIP models with sample size of (a) n = 15, (b) n = 30, and (c) n = 100 (2021) 8:5 Zou et al. Journal of Statistical Distributions and Applications Page 10 of 15 Table 2 Estimated coverage probabilities and median widths for the GCIs and bootstrap confidence intervals for the means generated from different PH distributions used in our simulation study Bootstrap Fiducial n λπ Cov. Prob. Med. Width Cov. Prob. Med. Width 15 0.2 0.914 0.867 0.962 1.023 1 0.5 0.918 0.867 0.965 1.035 0.8 0.870 0.600 0.979 0.925 0.2 0.922 2.733 0.956 2.828 0.5 0.920 2.867 0.954 2.933 0.8 0.853 2.067 0.968 2.443 30 0.2 0.931 0.667 0.953 0.708 0.5 0.930 0.667 0.954 0.715 0.8 0.891 0.467 0.974 0.573 0.2 0.941 1.967 0.956 2.003 0.5 0.937 2.067 0.950 2.080 0.8 0.911 1.533 0.954 1.647 100 0.2 0.943 0.370 0.946 0.378 0.5 0.944 0.380 0.951 0.384 0.8 0.938 0.280 0.956 0.292 0.2 0.949 1.100 0.950 1.099 5 0.5 0.942 1.150 0.946 1.149 0.8 0.936 0.870 0.946 0.881 Fig. 2 Q-Q plots for PH models with sample size of (a) n = 15, (b) n = 30, and (c) n = 100 (2021) 8:5 Zou et al. Journal of Statistical Distributions and Applications Page 11 of 15 Table 3 Estimated coverage probabilities and median widths for the GCIs and bootstrap confidence intervals for the means generated from different Poisson distributions used in our simulation study Bootstrap Fiducial n λ Cov. Prob. Med. Width Cov. Prob. Med. Width 1 0.897 0.933 0.960 1.055 5 0.913 2.133 0.962 2.442 1 0.926 0.700 0.952 0.728 5 0.934 1.567 0.958 1.667 1 0.945 0.390 0.951 0.392 5 0.948 0.870 0.956 0.885 which is confirmed by our results in Table 3. The asymptotic behavior is also observed from the simulation results: as the sample size n increases, the agreement between the actual p-value and the nominal p-value improves. Thelastset of simulationresultsistoestimatethe type Ierror ratesand powerofthe SLRT and fiducial test for testing the following for the ZIP mean under α = 0.05: H : μ ≤ μ versus H : μ>μ .The μ is assumed to be 1 − π and the true ZIP mean is 0 a 0 0 μ = (1 − π)λ. Therefore, the type I error rate could be estimated when λ = 1. The type I error rates and power of the SLRT and fiducial test are reported in Table 4. Generally, the performance of the two tests are almost identical. The type I error rates are all close to the nominal level for both tests while the power is increasing as λ or thesamplesizegets larger. Fig. 3 Q-Q plots for Poisson models with sample size of (a) n = 15, (b) n = 30, and (c) n = 100 (2021) 8:5 Zou et al. Journal of Statistical Distributions and Applications Page 12 of 15 Table 4 Type I error rates and power of the SLRT and fiducial test for testing the ZIP mean under α = 0.05: H : μ ≤ μ versus H : μ>μ ; μ = (1 − π)λ; μ = (1 − π) 0 0 a 0 0 λ n π SLRT Fiducial test 1 0.1 0.048 0.046 30 0.3 0.048 0.050 0.5 0.048 0.046 0.1 0.049 0.053 0.3 0.048 0.048 0.5 0.046 0.050 0.1 0.045 0.050 50 0.3 0.046 0.053 0.5 0.045 0.045 1.3 0.1 0.364 0.386 0.3 0.295 0.296 0.5 0.217 0.223 0.1 0.479 0.471 40 0.3 0.355 0.356 0.5 0.259 0.270 0.1 0.517 0.552 0.3 0.408 0.422 0.5 0.291 0.298 1.6 0.1 0.799 0.810 0.3 0.639 0.636 0.5 0.461 0.470 0.1 0.890 0.894 0.3 0.745 0.741 0.5 0.560 0.562 0.1 0.943 0.947 0.3 0.820 0.821 0.5 0.637 0.640 2 0.1 0.984 0.982 30 0.3 0.906 0.905 0.5 0.732 0.737 0.1 0.997 0.996 0.3 0.960 0.959 0.5 0.838 0.833 0.1 0.999 0.999 50 0.3 0.983 0.983 0.5 0.899 0.904 5 Application: urinary tract infection data We construct GCIs for a ZIP distribution fit to data on urinary tract infections (UTIs). Note that our fiducial approach will generate the same GCI no matter if the data fol- lows a ZIP or PH distribution. Therefore, the UTI data could also serve as an example for constructing a GCI for a PH distribution. These data came from n = 98 HIV-infected men who were treated by the Department of Internal Medicine at the Utrecht University Hospital in the Netherlands. The frequency of times those patients had a UTI was recorded as X. The frequency table is given in Table 5. The data were analyzed in (van den Broek 1995), who used a score test to detect if zero-inflation exists. Later, (Bhattacharya et al. 2008) and (Bayarri et al. 2008) applied Bayesian testing to test for zero-inflation. All (2021) 8:5 Zou et al. Journal of Statistical Distributions and Applications Page 13 of 15 Table 5 The frequency table of the number of UTIs recorded in the patients admitted at the Department of Internal Medicine at the Utrecht University Hospital X 0 1 2 3 Total Frequency 81 97198 of these analyses favor a ZIP distribution. Moreover, the use of a zero-inflated distribu- tion is appropriate because the zeros are likely arising from two subgroups of men: one group that are otherwise healthy aside from having HIV (structural zeros), and one group that has a history of other issues with their urinary system (e.g., kidney stones) and, thus, could be at higher risk of eventually developing a UTI (random zeros). The fiducial sample is set to be 10,000. The 95% GCI for the average number of UTIs that the patients have is (0.157, 0.434). The 95% SLRT confidence interval is (0.157, 0.426), which is close to the GCI, while the bootstrap confidence interval with 10,000 bootstrap samples is (0.143, 0.398).Eventhoughone caneasilycalculate thesamplemeanfrom these data (x ¯ = 0.266) and infer that, for example, the average number is less than 1, the approach we have presented now affords us with the additional insights that accompany confidence interval interpretations, such as the reliability of our estimate of the mean and how far the spread of that interval falls away from a particular value of interest. We also know from the coverage study in the previous section that the median width of the 95% GCI will be noticeably wider than the respective 95% bootstrap confidence interval. Practically speaking, this could have implications on the hospital’s treatment plans for these UTIs. If treatment plans are benchmarked against the 95% bootstrap confidence intervals, then smaller and larger values beyond the respective limits of that interval will be omitted from such plans, whereas those values will be reflected via the 95% GCI. 6Conclusion In this article, generalized fiducial inference on ZIP and PH distributions was studied for the first time and applied to a healthcare dataset. The practical contribution of this method is that one can now easily calculate and report a confidence interval along with an estimate of the mean if using either a ZIP or PH model. The theoretical advantage of this method is that it achieves good small sample properties except for when the zero proportion π is large and the Poisson parameter λ is small. Also, it does not depend on the selection of priors like Bayesian inference, but it only relies on the data generation equation. A simulation study demonstrated that, for the confidence interval of the mean of ZIP and PH distributions, the generalized fiducial inference works very well for various scenarios. Since the Poisson distribution can be considered as a special case of ZIP or PH distribution, the simulation also shows our method for ZIP and PH distributions can accommodate constructing the confidence interval of the mean of a Poisson distribution. Thus, if the goal is only to construct a confidence interval for the mean of the count data, our approach can be applied directly since it will not be necessary to detect for zero-inflation or decide if the data are under-/over-dispersed. Furthermore, our fiducial approach performed equally well as the SLRT when testing the ZIP mean. We note that there is some computational limitation of the proposed method since it involves finding the root of a sum of factorials. When the sample size is large or the Poisson mean parameter is large, the computational effort could become prohibitive. Uni- formly valid approximation exists for Stirling numbers of the second kind (Temme 1993), (2021) 8:5 Zou et al. Journal of Statistical Distributions and Applications Page 14 of 15 which can alleviate some of the computational burden, but this can translate to worse results for coverage probabilities. Such simulation results are not shown here. We also note that generalized fiducial inference can be very similar to Bayesian inference when a fiducial distribution is obtained. We highlighted earlier that (Bhattacharya et al. 2008) used Bayesian inference to test for the presence of zero-inflation. Future research will be focused on extending the use generalized fiducial inference for selecting the model between Poisson distribution and ZIP/PH models. Moreover, there are broader inference considerations when fitting ZIP/PH regression models, such as joint confidence intervals on the regression parameters and simultaneous confidence intervals over the values of the covariate space. The utility of such inference is underscored by the recent empha- sis placed on marginalized ZI regression models, which focuses on modeling the mean response across the two states (Long et al. 2014; Todem et al. 2016; Martin and Hall 2017). These are further extensions worth considering in the generalized fiducial framework. Abbreviations GCI: Generalized confidence interval; GPQ: Generalized pivotal quantity; PH: Poisson hurdle; SLRT: Signed likelihood ratio test; UTI: Urinary tract infection; ZIP: Zero-inflated poisson Acknowledgements The authors acknowledge Professor Kalimuthu Krishnamoorthy from the Department of Mathematics at the University of Louisiana at Lafayette for early discussions on the topic treated in this paper. The authors are also thankful to two reviewers, whose comments helped improve the quality of this manuscript. Authors’ contributions Zou contributed to the writing of the manuscript, development of methodology, and execution of simulation work. Hannig contributed to the writing of the manuscript, development of methodology, and development of the numerical routines. Young contributed to the organization and writing of the manuscript and designing the simulation studies. All authors read and approved the final manuscript. Funding This research was not funded by any institution. Availability of data and materials The UTI data are reported in the text. Simulation scripts for the results presented are available upon request from the authors. Competing interests The authors declare that they have no competing interests. Author details 1 2 Genentech, South San Francisco, California, USA. Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA. Dr. Bing Zhang Department of Statistics, University of Kentucky, Lexington, Kentucky, USA. Received: 16 September 2020 Accepted: 5 February 2021 References Bayarri, M. J., Berger, J. O., Datta, G. S.: Objective Bayes Testing of Poisson Versus Inflated Poisson Models. In: Pushing the Limits of Contemporary Statistics: Contributions in Honor of Jayanta K. Ghosh, B. Clarke and S. Ghosal (eds.) IMS Collections, vol. 3, pp. 105–121. Institute of Mathematical Statistics, Beachwood, (2008) Bhattacharya, A., Clarke, B. S., Datta, G. S.: A Bayesian test for excess zeros in a zero-inflated power series distribution. In: Balakrishnan, N., Peña, E. A., Silvapulle, M. J. (eds.) Beyond Parametrics in Interdisciplinary Research, pp. 89–104. Festschrift in Honor of Professor Pranab K. Sen, Institute of Mathematical Statistics, Beachwood, (2008) Cameron, A. C., Trivedi, P. K.: Regression-Based Tests for Overdispersion in the Poisson Model. J. Econ. 46(3), 347–364 (1990) nd Cameron, A. C., Trivedi, P. K.: Regression Analysis of Count Data, 2 edn. Cambridge, New York (2013) Cao, G., Hsu, W. W., Todem, D.: A Score-Type Test for Heterogeneity in Zero-Inflated Models in a Stratified Population. Stat. Med. 33(12), 2103–2114 (2014) Deng, D., Paul, S. R.: Score Tests for Zero-Inflation and Over-Dispersion in Generalized Linear Models. Stat. Sin. 15(1), 257–276 (2005) E, L., Hannig J., Iyer, H.: Fiducial Intervals for Variance Components in an Unbalanced Two-Component Normal Mixed Linear Model. J. Am. Stat. Assoc. 103(482), 854–865 (2008) Fisher, R. A.: The Fiducial Argument in Statistical Inference. Ann. Eugenics. 6(4), 391–398 (1935) Gu, K., Ng, H. K., Tang, M. L., Schucany, W. R.: Testing the Ratio of Two Poisson Rates. Biom. J. 50(2), 283–298 (2008) (2021) 8:5 Zou et al. Journal of Statistical Distributions and Applications Page 15 of 15 Hall, D. B., Berenhaut, K. S.: Score Tests for Heterogeneity and Overdispersion in Zero-Inflated Poisson and Binomial Regression Models. Can. J. Stat. 30(3), 415–430 (2002) Hannig, J.: On Generalized Fiducial Inference. Stat. Sin. 19(2), 491–544 (2009) Hannig, J., Iyer, H., Patterson, P.: Fiducial generalized confidence intervals. J. Am. Stat. Assoc. 101(473), 254–269 (2006) Hannig, J., Iyer, H., Lai, R. C. S., Lee, T. C. M.: Generalized fiducial inference: A review and new results. J. Am. Stat. Assoc. 111(515), 1346–1361 (2016) Itô, K.: Poisson Point Processes and Their Applications to Markov Processes. Springer, New York (2015) Janaskul, N., Hinde, J. P.: Score Tests for Zero-Inflated Poisson Models. Comput. Stat. Data Anal. 40(1), 75–96 (2002) Janaskul, N., Hinde, J. P.: Score Tests for Extra-Zero Models in Zero-Inflated Negative Binomial Models. Commun. Stat. Simul. Comput. 38(1), 92–108 (2008) Lambert, D.: Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing. Technometrics. 34(1), 1–14 (1992) Long, D. L., Preisser, J. S., Herring, A. H., Golin, C. E.: A Marginalized Zero-Inflated Poisson Regression Model with Overall Exposure Effects. Stat. Med. 33(29), 5151–5165 (2014) Martin, J., Hall, D. B.: Marginal Zero-Inflated Regression Models for Count Regression. J. Appl. Stat. 44(10), 1807–1826 (2017) Mathew, T., Young, D. S.: Fiducial-Based Tolerance Intervals for Some Discrete Distributions. Comput. Stat. Data Anal. 61, 38–49 (2013) Mullahy, J.: Specification and Testing of Some Modified Count Data Models. J. Econ. 33(3), 341–365 (1986) Ridout, M., Hinde, J., Demétrio CGB: A Score Test for Testing a Zero-Inflated Poisson Regression Model Against Zero-Inflated Negative Binomial Alternatives. Biometrics. 57(1), 219–223 (2001) Springael, L., Van Nieuwenhuyse, I.: On the Sum of Independent Zero-Truncated Poisson Random Variables. Research paper 2006-011. June, 1–15 (2006) Temme, N. M.: Asymptotic Estimates of Stirling Numbers. Stud. Appl. Math. 89(3), 233–243 (1993) Todem, D., Kim, K., Hsu, W. W.: Marginal Mean Models for Zero-Inflated Count Data. Biometrics. 72(13), 986–994 (2016) Todem, D., Hsu, W. W., Fine, J. P.: A Quasi-Score Statistic for Homogeneity Testing Against Covariate-Varying Heterogeneity. Scand. J. Stat. 45(3), 465–481 (2018) van den Broek, J.: A Score Test for Zero Inflation in a Poisson Distribution. Biometrics. 51(2), 738–743 (1995) Waguespack, D., Krishnamoorthy, K., Lee, M.: Tests and Confidence Intervals for the Mean of a Zero-Inflated Poisson Distribution. Am. J. Math. Manag. Sci. 39(4), 383–390 (2020) Weerahandi, S.: Generalized confidence intervals. In: Exact statistical methods for data analysis. Springer, New York, (1995) Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png

Journal of Statistical Distributions and Applications Springer Journals

http://www.deepdyve.com/lp/springer-journals/generalized-fiducial-inference-on-the-mean-of-zero-inflated-poisson-yZRw0GvGRY

Generalized fiducial inference on the mean of zero-inflated Poisson and Poisson hurdle models

Loading next page...

References (0)

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher: Springer Journals
eISSN: 2195-5832
DOI: 10.1186/s40488-021-00117-0
Publisher site: See Article on Publisher Site

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Generalized fiducial inference on the mean of zero-inflated Poisson and Poisson hurdle models

Generalized fiducial inference on the mean of zero-inflated Poisson and Poisson hurdle models

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Generalized fiducial inference on the mean of zero-inflated Poisson and Poisson hurdle models

Generalized fiducial inference on the mean of zero-inflated Poisson and Poisson hurdle models

References (0)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies