Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Estimating high levels exceedance probabilities by point process approach with applications to northern Moravia precipitation and discharges series

Estimating high levels exceedance probabilities by point process approach with applications to... J. Hydrol. Hydromech., 57, 2009, 3, 162­171 DOI: 10.2478/v10098-009-0015-z DANIELA JARUSKOVÁ Department of Mathematics, Faculty of Civil Engineering, Czech Technical University, Thákurova 7, CZ ­ 166 29 Praha 6, Czech Republic; mailto: jarus@mat.fsv.cvut.cz The paper by Jarusková and Hanek (2006) advocated application of the peaks over threshold method (POT method) for estimating the probability that a precipitation or discharges series exceeds a chosen high level. If daily precipitation amounts or average discharges are obtained at several stations one might be interested in estimating the probability that in the same time all variables of interest, e.g. precipitation amounts measured at several stations, exceed some chosen high levels. The paper explains how the method based on the point process approach may be used to get good estimates of such probabilities. Moreover, it presents some useful parametric models that were successfully applied by the author to some precipitation and discharges series of northern Moravia. KEY WORDS: Precipitation and Discharges Series, High Level Exceedance Probabilities, Modeling Tails of Multivariate Distribution, Peaks over Threshold Method, Modeling Dependence Structure, Inverse Arguments Tail Dependence Function. Daniela Jarusková: ODHADOVÁNÍ PRAVDPODOBNOSTÍ PEKROCENÍ VYSOKÝCH ÚROVNÍ POMOCÍ METODY BODOVÉHO PROCESU S APLIKACEMI PRO SRÁZKOVÉ A PRTOKOVÉ ADY NA SEVERNÍ MORAV. J. Hydrol. Hydromech, 57, 2009, 3; 7 lit., 4 obr., 4 tab. Clánek navazuje na práci Jarusková, Hanek (2006), kde autoi doporucovali pouzívání metody spicek nad prahem k odhadu pravdpodobností, s jakou srázková nebo prtoková ada pekrocí danou vysokou úrove. V pípad, ze se denní srázková ci prtoková ada mí ve více stanicích, mze nás zajímat, s jakou pravdpodobností soucasn (to znamená ve stejný den) vsechny studované ady, to je napíklad srázkové ady mené v nkolika stanicích, pekrocí njaké pedem stanovené vysoké úrovn. Clánek vysvtluje, jak lze k odhadu takových pravdpodobností pouzít metodu zalozenou na bodovém procesu. Zárove uvádí nkteré parametrické modely, které byly úspsn pouzity autorkou clánku pro odhady pravdpodobností pekrocení pro srázkové a prtokové ady na severní Morav. KLÍCOVÁ SLOVA: srázkové a prtokové ady, pravdpodobnosti pekrocení vysoké úrovn, modelování chvost vícerozmrného rozdlení, metoda spicek nad prahem, modelování struktury závislosti, funkce závislosti chvost daná v inverzních argumentech. Introduction Estimating high annual return levels of precipitation and discharges series is one of the basic problems of statistical hydrology. The problem has its parallel in estimating the probability that some given level is exceeded. The first problem consists in finding an appropriate level u for a given so that P(X > u) = , the second one consists in estimating P(X > x) for some real x. The variable X corresponds to a quantity of interest, e. g. a daily precipitation amount or a daily average discharge. 162 If daily measurements during several years are available, we may try to create a reasonable probabilistic model for the distribution of the studied variable. If we are interested in the probability of exceedance of some large value x, the method of peak over threshold (POT method) described in Jarusková and Hanek (2006) may be applied. The basic idea of the POT method is that the domain of possible values of the variable is split into two parts, i.e. below and above a chosen threshold. The tail above the threshold is estimated by a tail of extreme value distribution, i.e. by a generalized Pareto distribution. The POT method belongs to a field of mathematical statistics known as "statistics of extremes". The overview of stochastic methods suggested for studying different extremal problems in hydrology has been presented by Katz et al. (2002). Extreme weather conditions are often characterized not only by a very heavy rain at one site, but rather by a heavy rain on a vast area so that daily precipitation amounts at several meteorological sites across the area are large. Here we may be interested in estimating the probability this approach to real data. The method has been also explained in details by Beirlant et al. (2004). Its advantage consists in the fact that for a good estimates of (1) we do not need to estimate the distribution function F(x1, ..., xk), respectively the survival function S(x1, ..., xk), in its whole domain but it is sufficient to find a good estimator for large values of arguments only. Estimating is made in two steps. Step I In the first step the one-dimensional distribution functions F1, ..., Fk are estimated. Usually we estimate the marginal distribution functions Fi, i = 1, ..., k, by the peak over threshold method, i.e., we choose subjectively a threshold ui and we estimate the distribution function below ui, i.e. x ui, by a non-parametric estimate, e.g. by an empirical distribution function (or by its continuous version), while above ui, i.e. for x > ui, by a generalized Pareto distribution: -1 1 - 1 + ( x - u ) , 0, + F P ( x) = -( x -u ) + , = 0, 1 - e P ( X1 > x1 ,..., X k > xk ) = S ( x1 ,...xk ), (1) where Xi represents a daily precipitation amount at the i-th station. Similarly, supposing that a river has k tributaries, we may be interested in estimating (1), where Xi represents a daily average discharge of the i-th tributary. In the language of mathematical statistics we suppose that our observations are realizations of independent k dimensional vectors {(Xi1, ..., Xik), i = = 1, ..., n} with a distribution function F(x1, ..., xk). The goal of the statistical inference is to estimate (1) for large values x1, ..., xk. We would like to recall that for any dimension k there exists a relationship between the exceedance probability (survival function) S(x1, ..., xk) and the corresponding distribution function F(x1, ..., xk) given by a so called union-intersection formula. For instance for a two dimensional vector (X1, X2) it holds: S ( x1 , x2 ) = P ( X1 > x1 , X 2 > x2 ) = = 1 - F1 ( x1 ) - F2 ( x2 ) + F ( x1 , x2 ) . (2) (4) Similarly, for a three dimensional vector (X1, X2, X3) it holds: S ( x1 , x2 , x3 ) = P ( X1 > x1 , X 2 > x2 , X 3 > x3 , ) = = 1 - F1 ( x1 ) - F2 ( x2 ) - F3 ( x3 ) + F12 ( x1 , x2 ) + + F13 ( x1 , x3 ) + F23 ( x2 , x3 ) - F ( x1 , x2 , x3 ) , (3) where F1, F2, F3, F12, F13, F23 are the distribution functions corresponding to the lower dimensions. Point process approach method where the parameters i > 0 and i are estimated by their maximum likelihood estimates. (We denote a+ = max(a,0)). The detailed description of the POT method may be found in Jarusková and Hanek (2006). Using the above procedure we get the estimates of the one-dimensional marginal distribution functions for all coordinates i = 1, ..., k. Step II In the second step we estimate the "dependence structure". If we supposed for a while that all the one-dimensional marginal distributions would have been known we could transform the variables into the variables with any desired marginal distributions. One possibility would be to transform them into variables with standard normal marginals and to model the dependence structure by their correlation matrix. Clearly, in reality we do not know the right marginal distributions, but on the other hand, we can suppose that our estimates from the step I are so good that they differ from the right marginal distributions only negligibly. Fig. 1 shows a scatter plot of daily precipitation amounts at two chosen One of the methods for estimating (1) may be based on the point process approach. Joe et al. (1992) and Coles and Tawn (1991, 1994), who used the theoretical results by De Haan and Resnick (1977), worked out a procedure for application of stations and Fig. 2 shows a scatter plot of the transformed values ^ ^ (-1 ( F1 ( xi1 )) ,-1 ( F2 ( xi2 ))) , i = 1,..., n, (5) where -1 denotes the inverse standard normal dis^ ^ tribution function and F1 and F2 ­ empirical distribution functions that serve as estimates of the distribution functions F1, F2. We can see that the scatter plot in Fig. 2 does not look as a scatter plot of realizations of bivariate normal distributions with standard normal marginals. The transformed data exhibit stronger dependence in the upper tail than we would expect from realizations of a bivariate normal vector. Clearly, the idea to transform the variables into the variables distributed according to the normal distribution is not good. Instead of transforming the variables into the normally distributed variables we suggest that the variables are transformed into the variables distributed according to the standard Fréchet distribution with the distribution function G(x) = exp (­1/x) for x 0 and the inverse distribution function G-1(t) = = ­1/log(t) for 0 < t < 1. Theoretically, we use the transformation Z1 = ­1/log(F1(X1)), ..., Zk = = ­1/log(Fk(Xk)). Practically, it means that we transform the data vectors (xi1, ..., xik), i = 1, ..., n into the vectors ( zik ,..., zik ) = - 1 1 . ,..., - ^ ^ log F1 ( xi1 ) log Fk ( xik ) (6) Fig. 1. Scatter plot of daily precipitation amounts measured in two stations. Obr. 1. Rozptylový graf denních srázkových úhrn mených ve dvou stanicích. Fig. 2. Scatter plot of daily precipitation amounts measured at two station after transforming the data into standard normal variates using (5). Obr. 2. Rozptylový graf denních srázkových úhrn mených ve dvou stanicích po transformaci na normáln rozdlené veliciny za pouzití (5). The idea to approximate the upper tail of a multivariate vector (Z1, ..., Zk) by a tail of extreme value distribution comes from De Haan and Resnick (1977). They proved that a subset of the transformed data with all coordinates exceeding high thresholds form a point process that can be approximated by a Poisson process defined on Rk with a nonhomogeneous intensity measure . The connection between the measure and the distribution function of the transformed variables Z1, ..., Zk for (z1, ..., zk) large can be expressed with the help of a so-called inverse arguments tail dependence function A as follows: P ( Z1 z1 ,..., Z k zk ) e A ( z1 ,..., zk ) = - A( z1,..., zk ) spherical coordinates) by the transformation r = z1 + ... + zk, 1 = z1/r, ... , k = zk/r then the intensity measure factorizes: ( dr , d ) = R ( dr ) H ( d 8) with the measure R having a density function g(r) = = 1/r2 for r > 0. The measure H, usually called the spectral measure, is given on the set Sk = {(1, ..., k), i 0, i = 1,... , k, 1 + ...+ k = 1}, e.g. for k = = 2 the measure H is given on the vertex { (0, 1), (1, 0) }; for k = 3 the measure H is given on a triangle {(1,0,0), (0,1,0), (0,0,1) }. The factorization (8) means that for any Sk ( ( 0, z1 ) × ... × ( 0, zk ) ) , (7) where ((0, z1) × ... × (0, zk))c is a complement of the multidimensional rectangle ((0, z1) × ... × (0, zk)). For more information see Beirlant et al. (2004). The advantageous property of is the following. If we transform the Cartesian coordinates (z1, ..., zk) into the coordinates (r,1, ..., kthat resemble the { ( r , ) ; r > r1} {( r , ) ; r > r2 } = . { ( r , ) ; r > r2 } {( r , ) ; r > r1 } (9) The goal of the statistical inference in the step 2 is to estimate the spectral measure H. We know that a Poisson process with an intensity measure is a good approximation for the subset of data that are 165 far away from the origin. In practice it means that we choose subjectively some threshold ro, transform the values {( zi1, ..., zik ), i = 1, ..., n } into {(ri, i1, ..., ik), i = 1, ..., n} by the above introduced transformation and deal with the subset of data ro = {(ri,i1, ..., ik), r>ro } only. Factorization (8) of enables to find an estimate of the spectral measure H or its density h if it exists. Instead of dealing with h(1, ..., k) on the set Sk we often estimate hs(1, ..., k-1) = h(1, ..., 1 ­ 1­ ... ­ k-1), i.e. for k = 2 we estimate hs () = h(,1­ ), for k = 3 we estimate hs (1, 2) = h (1, 2, 1­ 1 ­ 2) etc. Most frequently we model the function hs by a known mathematical function with unknown parameters (1, ..., p) and estimate these parameters by their maximum likelihood estimators. More precisely we search such values of parameters that maximize venient that the inverse arguments tail dependence function has an explicit expression so that no numerical integration (11) is needed. Bivariate logistic distribution The spectral density function h has the form: h (1 , 2 ) = 1 -1- 1 = ( - 1)(1 , 2 ) + 1 2 1 -2 , (13) while for the function A it holds: 1 1 A ( z1 , z2 ) = + . z1 z2 (14) {i,( ri ,i1,...,ik )ro} log hs i1 ,..., i,k -1;1 ,..., p . (10) After having estimated the spectral density we may proceed by estimating the inverse arguments tail dependence function A by replacing the true ^ density function h by its estimate h in the expression (11). For some models there exists an explicit formula for the integral (11), for some others the integral has to be calculated numerically. A ( z1 ,..., zk ) = S max 1 ,..., k k zk z1 × h (1 ,..., k ) d 1..., d k . × The model has one parameter > 1 that expresses the dependence between the variables. The larger the value of the stronger the dependence. Sometimes instead of the parameter the parameter = 1/ is used. Multivariate symmetric logistic distribution The model is a generalization of the preceding bivariate logistic model for k 2. It has one parameter r > 1 that expresses the over-all dependence. The larger the value of r the stronger the dependence. The assumption that the dependence between any couple of variables Xi, Xj, i j, i, j = 1, ..., k is the same, seems to be too restrictive but the model gives very often reasonable results and is easy to deal with. The spectral density has a following form: h (1 ,..., k ) = k = ( jr - 1) j j =1 j =1 k -1 -( r +1) 1 r -k (11) Replacing the tail dependence function A in (7) ^ by its estimate A the multivariate distribution function may be approximated for large values of arguments by ^ F ( x1 ,..., xk ) ^ exp - A log . -1 ^ F (x )) ,..., -1 ^ (x ) log Fk k k 1 j =1 r j (15) (12) The function A may be expressed as follows: - - A ( z1 ,..., zk ) = z1 r + ... + zk r Finally, the exceedance probability may be calculated using the union-intersection formula. Models 1r (16) Trivariate asymmetric logistic distribution To capture the dependence between any pair of variables is not an easy task. It is possible to do it for the three dimensional case by the following We present here several models that belong to a family of so-called logistic distributions. It is con166 model. We applied this model to estimate the exceedance probabilities for the precipitation series measured at three meteorological stations. The spectral density has a following rather complicated form, see Eq. (17): h (1 , 2 , 3 ) = 1 1 1 1 1 1 1 = a , , +1 +1 +1 , 2 1 2 3 1 2 3 2 (17) where 1 y1 + y1 1 + y 2 + a ( y1 , y2 , y3 ) = ( - 1)( 2 - 1) 1 2 2 2 1 -2 -1 -1 -1 -1 2 + y 2 2 y1 + y1 1 y2 y1 1 y21 y3 2 + 2 1 3 2 2 y3 -3 -1 - 2 -1 -1 -1 y2 + y1 1 1 y2 2 + y3 2 2 y1 1 y2 2 y3 2 + 1 1 - 2 + ( - 1) . y1 1 + y21 1 + y2 2 + y3 2 2 1 1 -1 - 2 -1 -1 -1 2 + y 2 2 ( - 1) y1 + y1 1 y2 y1 1 y2 2 y3 2 + 2 2 1 3 1 1 2 2 -1 1 1 -2 1-1 1-1 2 -1 2 1 y2 + y1 y1 y2 y3 . + (1 - 1) y2 + y3 (18) The function A may be expressed as follows: A ( z1 , z2 , z3 ) = 1 1 1 2 1 1 1 1 = + + + 1 2 z3 2 z1 1 2 z 2 z 2 2 1 ( ) ( ) . (19) The model has three parameters > 1, 1 > 1, 2 > 1. The parameter expresses the baseline dependence between the variables Z1, Z3, while the parameters 1 and 2 add some dependence to the respective pairs Z1, Z2 and Z3, Z2. Applications The studied data are daily measurements, i.e. the daily precipitation amounts or the average discharges. We are interested in the probability that in the same day the measurements in all stations ex- ceed certain given levels. Of course, we are especially interested in high levels that are on the border of the domain where the values were observed, or even beyond it, it means in such levels where it is unreasonable to use relative frequencies as estimators. There are two aspects that should be considered when studying daily measurements. The first one is the dependence between the neighboring observations and the second one is the seasonality. It was shown by Jarusková and Hanek (2006) that if these aspects are not taken into account then exceedance probabilities are usually slightly overestimated. The problem of seasonality may be solved by splitting the series into more homogeneous parts corresponding to different seasons. The problem of dependence is more difficult to solve. If we are interested in the probability that a daily measurements exceed some given levels we can use a declustering technique to get a good estimate. However, the probability that during a year the measurements in all stations will exceed in the same day the given levels may be affected by this dependence. Despite suggestions of different authors a simple way how to incorporate the dependence into the model does not exist . Example 1 The data describes daily average discharges [m3 s ] of Opava and Opavice measured at Krnov in the period 1. 11. 1963 ­ 31. 10. 2003, i.e. the both series consist of n = 16 071 observations. We denote by X1 a daily average of Opava while by X2 a daily average of Opavice. Suppose that we are interested in P(X1 > x1, X2 > x2) for (x1, x2) = (40, 20), (45, 25), (55, 30), (100, 50). We proceed in two steps. In the first step we estimate the marginal distributions of X1 and X2 by the POT method. The thresholds are chosen to be equal to the 95% quantiles of the observations. -1 2). Column 3 shows the estimates of the same probabilities by simple relative frequencies. It seems that the estimates based on the stochastic model agree well with the relative frequencies. However, for larger values of arguments they slightly overestimate the probabilities of interest. Example 2 To assess the probability of extreme wet weather conditions we have chosen three station in northern Moravia with different precipitation characteristics located not extremaly close to each other: Hemanovice (HE), Albrechtice ­ Záry (ZY), Lichnov (LI). The data set consists of n = 15131 daily precipitation amounts [mm] measured at each of these stations from the period 1/1/1960 ­ 6/2/2005 (some data are missing). In the first step we estimate the marginal distribution functions using the POT method with the thresholds equal to the 95% quantiles of all observations. Tab. 3 presents the threshold values and the parameters of the Pareto distribution. T a b l e 3. The chosen thresholds and the estimates of the parameters of the generalized Pareto distribution for estimating the marginal distribution functions of daily precipitation amounts at Hemanovice, Albrechtice ­ Záry and Lichnov. T a b u l k a 3. Vybrané hodnoty prah a odhady parametr Paretova rozdlení pro odhady marginálních distribucních funkcí denních srázkových úhrn ve stanicích Hemanovice, Albrechtice ­ Záry a Lichnov. Station HE ZY LI T a b l e 1. The chosen thresholds and the estimates of the parameters of the generalized Pareto distributions for estimating the marginal distribution functions of daily discharges of Opava and Opavice. T a b u l k a 1. Vybrané hodnoty prah a odhady parametr Paretova rozdlení pro odhady marginálních distribucních funkcí denních prtok Opavy a Opavice. River Opava Opavice ui In the second step we transform the observed values and maximize (10) with the spectral density function h given by (13). The max-likelihood esti^ mate of the parameter of the bivariate logistic distribution is 2.627. The histogram of the angular components {1i, i = 1, ..., n} together with the spectral density function hs(1) = h(1, 1 ­ 1) given by (13) is shown in Fig. 3. We see that the fit is not bad. Tab. 2 presents the estimated exceedance probabilities using the bivariate logistic model (column T a b l e 2. The exceedance probabilities P(X1 > x1, X2 > x2) estimated by the suggested method and by the relative frequencies. T a b u l k a 2. Pravdpodobnosti pekrocení P(X1 > x1, X2 > x2) odhadnuté navrhovanou metodou a relativními cetnostmi. (x1, x240, 2045, 2555, 30100, 50) Estimates of probabilities 0.00153 0.00103 0.00065 0.00017 Relative frequencies 0.00161 0.00118 0.00062 0.00012 ui In the second step we transform the data and model the dependence structure by a trivariate asymmetric logistic distribution. Fig. 4 presents a scatter plot of two first angular components calculated from the studied data. The estimates of the parameters of the asymmetric logistic distribution obtained by the maximum likelihood method, i.e. by maximizing (10) with h ^ ^ defined by (17) are equal to = 1.773, 1 = = 1.235, ^ = 1.221. For comparison we also model the dependence structure by a trivariate symmetric logistic distribution with the spectral ^ density (15). The maximum likelihood estimate r = 1.81. Fig. 3. Histogram of the angular components corresponding to daily average discharges of Opava and Opavice and the estimated spectral density of bivariate logistic model. Obr. 3. Histogram úhlové slozky spoctené z denních prmrných prtok Opavy a Opavice a odhadnutá spektrální hustota logistického rozdlení pro dv promnné. Fig. 4. Scatter plot of the first two angular components corresponding to the daily precipiation amounts measure at the stations HE, RE, LI. Obr. 4. Rozptylový graf prvních dvou úhlových slozek pocítaných z denních srázkových úhrn mených ve stanicích HE, RE, LI. Tab. 4 shows the estimated exceedance probabilities for several triples of levels. The real exceedance frequency was equal to 0 for all considered triples. T a b l e 4. The estimated exceedance probabilities P(X1 > x1, X2 > x2, X3 > x3) when the dependence structure was modeled by the asymmetric logistic distribution (column 4) and by the multivariate symmetric distribution (column 5). T a b u l k a 4. Odhadnuté pravdpodobnosti pekrocení P(X1 > x1, X2 > x2, X3 > x3), jestlize byla závislost mezi promnnými modelována pomocí asymetrického logistického rozdlení (sloupec 4) nebo pomocí vícerozmrného symetrického rozdlení (sloupec 5). x1 x2 x3 Estimated probability 56.9 10-6 162.6 10-6 3.9 10-6 117.3 10-6 Estimated probability 30.0 10-6 183.1 10-6 2.3 10-6 102.6 10-6 variables is the same and to use a multivariate symmetric logistic distribution. The hydrologists like to estimate the probability given in years rather than in days. The quality of estimates is affected by seasonality and dependence between observations in subsequent days. The effect of these two factors in one-dimensional case was discussed by Jarusková and Hanek (2006). In the multivariate case the situation is similar. According to our experience, despite the mentioned problems the method yields reasonable results. Acknowledgemen. The study presented was partly carried out within the framework of the project MSM6840770002, GACR 201/09/0775. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Hydrology and Hydromechanics de Gruyter

Estimating high levels exceedance probabilities by point process approach with applications to northern Moravia precipitation and discharges series

Loading next page...
 
/lp/de-gruyter/estimating-high-levels-exceedance-probabilities-by-point-process-WLg3IRMwev

References (8)

Publisher
de Gruyter
Copyright
Copyright © 2009 by the
ISSN
0042-790X
DOI
10.2478/v10098-009-0015-z
Publisher site
See Article on Publisher Site

Abstract

J. Hydrol. Hydromech., 57, 2009, 3, 162­171 DOI: 10.2478/v10098-009-0015-z DANIELA JARUSKOVÁ Department of Mathematics, Faculty of Civil Engineering, Czech Technical University, Thákurova 7, CZ ­ 166 29 Praha 6, Czech Republic; mailto: jarus@mat.fsv.cvut.cz The paper by Jarusková and Hanek (2006) advocated application of the peaks over threshold method (POT method) for estimating the probability that a precipitation or discharges series exceeds a chosen high level. If daily precipitation amounts or average discharges are obtained at several stations one might be interested in estimating the probability that in the same time all variables of interest, e.g. precipitation amounts measured at several stations, exceed some chosen high levels. The paper explains how the method based on the point process approach may be used to get good estimates of such probabilities. Moreover, it presents some useful parametric models that were successfully applied by the author to some precipitation and discharges series of northern Moravia. KEY WORDS: Precipitation and Discharges Series, High Level Exceedance Probabilities, Modeling Tails of Multivariate Distribution, Peaks over Threshold Method, Modeling Dependence Structure, Inverse Arguments Tail Dependence Function. Daniela Jarusková: ODHADOVÁNÍ PRAVDPODOBNOSTÍ PEKROCENÍ VYSOKÝCH ÚROVNÍ POMOCÍ METODY BODOVÉHO PROCESU S APLIKACEMI PRO SRÁZKOVÉ A PRTOKOVÉ ADY NA SEVERNÍ MORAV. J. Hydrol. Hydromech, 57, 2009, 3; 7 lit., 4 obr., 4 tab. Clánek navazuje na práci Jarusková, Hanek (2006), kde autoi doporucovali pouzívání metody spicek nad prahem k odhadu pravdpodobností, s jakou srázková nebo prtoková ada pekrocí danou vysokou úrove. V pípad, ze se denní srázková ci prtoková ada mí ve více stanicích, mze nás zajímat, s jakou pravdpodobností soucasn (to znamená ve stejný den) vsechny studované ady, to je napíklad srázkové ady mené v nkolika stanicích, pekrocí njaké pedem stanovené vysoké úrovn. Clánek vysvtluje, jak lze k odhadu takových pravdpodobností pouzít metodu zalozenou na bodovém procesu. Zárove uvádí nkteré parametrické modely, které byly úspsn pouzity autorkou clánku pro odhady pravdpodobností pekrocení pro srázkové a prtokové ady na severní Morav. KLÍCOVÁ SLOVA: srázkové a prtokové ady, pravdpodobnosti pekrocení vysoké úrovn, modelování chvost vícerozmrného rozdlení, metoda spicek nad prahem, modelování struktury závislosti, funkce závislosti chvost daná v inverzních argumentech. Introduction Estimating high annual return levels of precipitation and discharges series is one of the basic problems of statistical hydrology. The problem has its parallel in estimating the probability that some given level is exceeded. The first problem consists in finding an appropriate level u for a given so that P(X > u) = , the second one consists in estimating P(X > x) for some real x. The variable X corresponds to a quantity of interest, e. g. a daily precipitation amount or a daily average discharge. 162 If daily measurements during several years are available, we may try to create a reasonable probabilistic model for the distribution of the studied variable. If we are interested in the probability of exceedance of some large value x, the method of peak over threshold (POT method) described in Jarusková and Hanek (2006) may be applied. The basic idea of the POT method is that the domain of possible values of the variable is split into two parts, i.e. below and above a chosen threshold. The tail above the threshold is estimated by a tail of extreme value distribution, i.e. by a generalized Pareto distribution. The POT method belongs to a field of mathematical statistics known as "statistics of extremes". The overview of stochastic methods suggested for studying different extremal problems in hydrology has been presented by Katz et al. (2002). Extreme weather conditions are often characterized not only by a very heavy rain at one site, but rather by a heavy rain on a vast area so that daily precipitation amounts at several meteorological sites across the area are large. Here we may be interested in estimating the probability this approach to real data. The method has been also explained in details by Beirlant et al. (2004). Its advantage consists in the fact that for a good estimates of (1) we do not need to estimate the distribution function F(x1, ..., xk), respectively the survival function S(x1, ..., xk), in its whole domain but it is sufficient to find a good estimator for large values of arguments only. Estimating is made in two steps. Step I In the first step the one-dimensional distribution functions F1, ..., Fk are estimated. Usually we estimate the marginal distribution functions Fi, i = 1, ..., k, by the peak over threshold method, i.e., we choose subjectively a threshold ui and we estimate the distribution function below ui, i.e. x ui, by a non-parametric estimate, e.g. by an empirical distribution function (or by its continuous version), while above ui, i.e. for x > ui, by a generalized Pareto distribution: -1 1 - 1 + ( x - u ) , 0, + F P ( x) = -( x -u ) + , = 0, 1 - e P ( X1 > x1 ,..., X k > xk ) = S ( x1 ,...xk ), (1) where Xi represents a daily precipitation amount at the i-th station. Similarly, supposing that a river has k tributaries, we may be interested in estimating (1), where Xi represents a daily average discharge of the i-th tributary. In the language of mathematical statistics we suppose that our observations are realizations of independent k dimensional vectors {(Xi1, ..., Xik), i = = 1, ..., n} with a distribution function F(x1, ..., xk). The goal of the statistical inference is to estimate (1) for large values x1, ..., xk. We would like to recall that for any dimension k there exists a relationship between the exceedance probability (survival function) S(x1, ..., xk) and the corresponding distribution function F(x1, ..., xk) given by a so called union-intersection formula. For instance for a two dimensional vector (X1, X2) it holds: S ( x1 , x2 ) = P ( X1 > x1 , X 2 > x2 ) = = 1 - F1 ( x1 ) - F2 ( x2 ) + F ( x1 , x2 ) . (2) (4) Similarly, for a three dimensional vector (X1, X2, X3) it holds: S ( x1 , x2 , x3 ) = P ( X1 > x1 , X 2 > x2 , X 3 > x3 , ) = = 1 - F1 ( x1 ) - F2 ( x2 ) - F3 ( x3 ) + F12 ( x1 , x2 ) + + F13 ( x1 , x3 ) + F23 ( x2 , x3 ) - F ( x1 , x2 , x3 ) , (3) where F1, F2, F3, F12, F13, F23 are the distribution functions corresponding to the lower dimensions. Point process approach method where the parameters i > 0 and i are estimated by their maximum likelihood estimates. (We denote a+ = max(a,0)). The detailed description of the POT method may be found in Jarusková and Hanek (2006). Using the above procedure we get the estimates of the one-dimensional marginal distribution functions for all coordinates i = 1, ..., k. Step II In the second step we estimate the "dependence structure". If we supposed for a while that all the one-dimensional marginal distributions would have been known we could transform the variables into the variables with any desired marginal distributions. One possibility would be to transform them into variables with standard normal marginals and to model the dependence structure by their correlation matrix. Clearly, in reality we do not know the right marginal distributions, but on the other hand, we can suppose that our estimates from the step I are so good that they differ from the right marginal distributions only negligibly. Fig. 1 shows a scatter plot of daily precipitation amounts at two chosen One of the methods for estimating (1) may be based on the point process approach. Joe et al. (1992) and Coles and Tawn (1991, 1994), who used the theoretical results by De Haan and Resnick (1977), worked out a procedure for application of stations and Fig. 2 shows a scatter plot of the transformed values ^ ^ (-1 ( F1 ( xi1 )) ,-1 ( F2 ( xi2 ))) , i = 1,..., n, (5) where -1 denotes the inverse standard normal dis^ ^ tribution function and F1 and F2 ­ empirical distribution functions that serve as estimates of the distribution functions F1, F2. We can see that the scatter plot in Fig. 2 does not look as a scatter plot of realizations of bivariate normal distributions with standard normal marginals. The transformed data exhibit stronger dependence in the upper tail than we would expect from realizations of a bivariate normal vector. Clearly, the idea to transform the variables into the variables distributed according to the normal distribution is not good. Instead of transforming the variables into the normally distributed variables we suggest that the variables are transformed into the variables distributed according to the standard Fréchet distribution with the distribution function G(x) = exp (­1/x) for x 0 and the inverse distribution function G-1(t) = = ­1/log(t) for 0 < t < 1. Theoretically, we use the transformation Z1 = ­1/log(F1(X1)), ..., Zk = = ­1/log(Fk(Xk)). Practically, it means that we transform the data vectors (xi1, ..., xik), i = 1, ..., n into the vectors ( zik ,..., zik ) = - 1 1 . ,..., - ^ ^ log F1 ( xi1 ) log Fk ( xik ) (6) Fig. 1. Scatter plot of daily precipitation amounts measured in two stations. Obr. 1. Rozptylový graf denních srázkových úhrn mených ve dvou stanicích. Fig. 2. Scatter plot of daily precipitation amounts measured at two station after transforming the data into standard normal variates using (5). Obr. 2. Rozptylový graf denních srázkových úhrn mených ve dvou stanicích po transformaci na normáln rozdlené veliciny za pouzití (5). The idea to approximate the upper tail of a multivariate vector (Z1, ..., Zk) by a tail of extreme value distribution comes from De Haan and Resnick (1977). They proved that a subset of the transformed data with all coordinates exceeding high thresholds form a point process that can be approximated by a Poisson process defined on Rk with a nonhomogeneous intensity measure . The connection between the measure and the distribution function of the transformed variables Z1, ..., Zk for (z1, ..., zk) large can be expressed with the help of a so-called inverse arguments tail dependence function A as follows: P ( Z1 z1 ,..., Z k zk ) e A ( z1 ,..., zk ) = - A( z1,..., zk ) spherical coordinates) by the transformation r = z1 + ... + zk, 1 = z1/r, ... , k = zk/r then the intensity measure factorizes: ( dr , d ) = R ( dr ) H ( d 8) with the measure R having a density function g(r) = = 1/r2 for r > 0. The measure H, usually called the spectral measure, is given on the set Sk = {(1, ..., k), i 0, i = 1,... , k, 1 + ...+ k = 1}, e.g. for k = = 2 the measure H is given on the vertex { (0, 1), (1, 0) }; for k = 3 the measure H is given on a triangle {(1,0,0), (0,1,0), (0,0,1) }. The factorization (8) means that for any Sk ( ( 0, z1 ) × ... × ( 0, zk ) ) , (7) where ((0, z1) × ... × (0, zk))c is a complement of the multidimensional rectangle ((0, z1) × ... × (0, zk)). For more information see Beirlant et al. (2004). The advantageous property of is the following. If we transform the Cartesian coordinates (z1, ..., zk) into the coordinates (r,1, ..., kthat resemble the { ( r , ) ; r > r1} {( r , ) ; r > r2 } = . { ( r , ) ; r > r2 } {( r , ) ; r > r1 } (9) The goal of the statistical inference in the step 2 is to estimate the spectral measure H. We know that a Poisson process with an intensity measure is a good approximation for the subset of data that are 165 far away from the origin. In practice it means that we choose subjectively some threshold ro, transform the values {( zi1, ..., zik ), i = 1, ..., n } into {(ri, i1, ..., ik), i = 1, ..., n} by the above introduced transformation and deal with the subset of data ro = {(ri,i1, ..., ik), r>ro } only. Factorization (8) of enables to find an estimate of the spectral measure H or its density h if it exists. Instead of dealing with h(1, ..., k) on the set Sk we often estimate hs(1, ..., k-1) = h(1, ..., 1 ­ 1­ ... ­ k-1), i.e. for k = 2 we estimate hs () = h(,1­ ), for k = 3 we estimate hs (1, 2) = h (1, 2, 1­ 1 ­ 2) etc. Most frequently we model the function hs by a known mathematical function with unknown parameters (1, ..., p) and estimate these parameters by their maximum likelihood estimators. More precisely we search such values of parameters that maximize venient that the inverse arguments tail dependence function has an explicit expression so that no numerical integration (11) is needed. Bivariate logistic distribution The spectral density function h has the form: h (1 , 2 ) = 1 -1- 1 = ( - 1)(1 , 2 ) + 1 2 1 -2 , (13) while for the function A it holds: 1 1 A ( z1 , z2 ) = + . z1 z2 (14) {i,( ri ,i1,...,ik )ro} log hs i1 ,..., i,k -1;1 ,..., p . (10) After having estimated the spectral density we may proceed by estimating the inverse arguments tail dependence function A by replacing the true ^ density function h by its estimate h in the expression (11). For some models there exists an explicit formula for the integral (11), for some others the integral has to be calculated numerically. A ( z1 ,..., zk ) = S max 1 ,..., k k zk z1 × h (1 ,..., k ) d 1..., d k . × The model has one parameter > 1 that expresses the dependence between the variables. The larger the value of the stronger the dependence. Sometimes instead of the parameter the parameter = 1/ is used. Multivariate symmetric logistic distribution The model is a generalization of the preceding bivariate logistic model for k 2. It has one parameter r > 1 that expresses the over-all dependence. The larger the value of r the stronger the dependence. The assumption that the dependence between any couple of variables Xi, Xj, i j, i, j = 1, ..., k is the same, seems to be too restrictive but the model gives very often reasonable results and is easy to deal with. The spectral density has a following form: h (1 ,..., k ) = k = ( jr - 1) j j =1 j =1 k -1 -( r +1) 1 r -k (11) Replacing the tail dependence function A in (7) ^ by its estimate A the multivariate distribution function may be approximated for large values of arguments by ^ F ( x1 ,..., xk ) ^ exp - A log . -1 ^ F (x )) ,..., -1 ^ (x ) log Fk k k 1 j =1 r j (15) (12) The function A may be expressed as follows: - - A ( z1 ,..., zk ) = z1 r + ... + zk r Finally, the exceedance probability may be calculated using the union-intersection formula. Models 1r (16) Trivariate asymmetric logistic distribution To capture the dependence between any pair of variables is not an easy task. It is possible to do it for the three dimensional case by the following We present here several models that belong to a family of so-called logistic distributions. It is con166 model. We applied this model to estimate the exceedance probabilities for the precipitation series measured at three meteorological stations. The spectral density has a following rather complicated form, see Eq. (17): h (1 , 2 , 3 ) = 1 1 1 1 1 1 1 = a , , +1 +1 +1 , 2 1 2 3 1 2 3 2 (17) where 1 y1 + y1 1 + y 2 + a ( y1 , y2 , y3 ) = ( - 1)( 2 - 1) 1 2 2 2 1 -2 -1 -1 -1 -1 2 + y 2 2 y1 + y1 1 y2 y1 1 y21 y3 2 + 2 1 3 2 2 y3 -3 -1 - 2 -1 -1 -1 y2 + y1 1 1 y2 2 + y3 2 2 y1 1 y2 2 y3 2 + 1 1 - 2 + ( - 1) . y1 1 + y21 1 + y2 2 + y3 2 2 1 1 -1 - 2 -1 -1 -1 2 + y 2 2 ( - 1) y1 + y1 1 y2 y1 1 y2 2 y3 2 + 2 2 1 3 1 1 2 2 -1 1 1 -2 1-1 1-1 2 -1 2 1 y2 + y1 y1 y2 y3 . + (1 - 1) y2 + y3 (18) The function A may be expressed as follows: A ( z1 , z2 , z3 ) = 1 1 1 2 1 1 1 1 = + + + 1 2 z3 2 z1 1 2 z 2 z 2 2 1 ( ) ( ) . (19) The model has three parameters > 1, 1 > 1, 2 > 1. The parameter expresses the baseline dependence between the variables Z1, Z3, while the parameters 1 and 2 add some dependence to the respective pairs Z1, Z2 and Z3, Z2. Applications The studied data are daily measurements, i.e. the daily precipitation amounts or the average discharges. We are interested in the probability that in the same day the measurements in all stations ex- ceed certain given levels. Of course, we are especially interested in high levels that are on the border of the domain where the values were observed, or even beyond it, it means in such levels where it is unreasonable to use relative frequencies as estimators. There are two aspects that should be considered when studying daily measurements. The first one is the dependence between the neighboring observations and the second one is the seasonality. It was shown by Jarusková and Hanek (2006) that if these aspects are not taken into account then exceedance probabilities are usually slightly overestimated. The problem of seasonality may be solved by splitting the series into more homogeneous parts corresponding to different seasons. The problem of dependence is more difficult to solve. If we are interested in the probability that a daily measurements exceed some given levels we can use a declustering technique to get a good estimate. However, the probability that during a year the measurements in all stations will exceed in the same day the given levels may be affected by this dependence. Despite suggestions of different authors a simple way how to incorporate the dependence into the model does not exist . Example 1 The data describes daily average discharges [m3 s ] of Opava and Opavice measured at Krnov in the period 1. 11. 1963 ­ 31. 10. 2003, i.e. the both series consist of n = 16 071 observations. We denote by X1 a daily average of Opava while by X2 a daily average of Opavice. Suppose that we are interested in P(X1 > x1, X2 > x2) for (x1, x2) = (40, 20), (45, 25), (55, 30), (100, 50). We proceed in two steps. In the first step we estimate the marginal distributions of X1 and X2 by the POT method. The thresholds are chosen to be equal to the 95% quantiles of the observations. -1 2). Column 3 shows the estimates of the same probabilities by simple relative frequencies. It seems that the estimates based on the stochastic model agree well with the relative frequencies. However, for larger values of arguments they slightly overestimate the probabilities of interest. Example 2 To assess the probability of extreme wet weather conditions we have chosen three station in northern Moravia with different precipitation characteristics located not extremaly close to each other: Hemanovice (HE), Albrechtice ­ Záry (ZY), Lichnov (LI). The data set consists of n = 15131 daily precipitation amounts [mm] measured at each of these stations from the period 1/1/1960 ­ 6/2/2005 (some data are missing). In the first step we estimate the marginal distribution functions using the POT method with the thresholds equal to the 95% quantiles of all observations. Tab. 3 presents the threshold values and the parameters of the Pareto distribution. T a b l e 3. The chosen thresholds and the estimates of the parameters of the generalized Pareto distribution for estimating the marginal distribution functions of daily precipitation amounts at Hemanovice, Albrechtice ­ Záry and Lichnov. T a b u l k a 3. Vybrané hodnoty prah a odhady parametr Paretova rozdlení pro odhady marginálních distribucních funkcí denních srázkových úhrn ve stanicích Hemanovice, Albrechtice ­ Záry a Lichnov. Station HE ZY LI T a b l e 1. The chosen thresholds and the estimates of the parameters of the generalized Pareto distributions for estimating the marginal distribution functions of daily discharges of Opava and Opavice. T a b u l k a 1. Vybrané hodnoty prah a odhady parametr Paretova rozdlení pro odhady marginálních distribucních funkcí denních prtok Opavy a Opavice. River Opava Opavice ui In the second step we transform the observed values and maximize (10) with the spectral density function h given by (13). The max-likelihood esti^ mate of the parameter of the bivariate logistic distribution is 2.627. The histogram of the angular components {1i, i = 1, ..., n} together with the spectral density function hs(1) = h(1, 1 ­ 1) given by (13) is shown in Fig. 3. We see that the fit is not bad. Tab. 2 presents the estimated exceedance probabilities using the bivariate logistic model (column T a b l e 2. The exceedance probabilities P(X1 > x1, X2 > x2) estimated by the suggested method and by the relative frequencies. T a b u l k a 2. Pravdpodobnosti pekrocení P(X1 > x1, X2 > x2) odhadnuté navrhovanou metodou a relativními cetnostmi. (x1, x240, 2045, 2555, 30100, 50) Estimates of probabilities 0.00153 0.00103 0.00065 0.00017 Relative frequencies 0.00161 0.00118 0.00062 0.00012 ui In the second step we transform the data and model the dependence structure by a trivariate asymmetric logistic distribution. Fig. 4 presents a scatter plot of two first angular components calculated from the studied data. The estimates of the parameters of the asymmetric logistic distribution obtained by the maximum likelihood method, i.e. by maximizing (10) with h ^ ^ defined by (17) are equal to = 1.773, 1 = = 1.235, ^ = 1.221. For comparison we also model the dependence structure by a trivariate symmetric logistic distribution with the spectral ^ density (15). The maximum likelihood estimate r = 1.81. Fig. 3. Histogram of the angular components corresponding to daily average discharges of Opava and Opavice and the estimated spectral density of bivariate logistic model. Obr. 3. Histogram úhlové slozky spoctené z denních prmrných prtok Opavy a Opavice a odhadnutá spektrální hustota logistického rozdlení pro dv promnné. Fig. 4. Scatter plot of the first two angular components corresponding to the daily precipiation amounts measure at the stations HE, RE, LI. Obr. 4. Rozptylový graf prvních dvou úhlových slozek pocítaných z denních srázkových úhrn mených ve stanicích HE, RE, LI. Tab. 4 shows the estimated exceedance probabilities for several triples of levels. The real exceedance frequency was equal to 0 for all considered triples. T a b l e 4. The estimated exceedance probabilities P(X1 > x1, X2 > x2, X3 > x3) when the dependence structure was modeled by the asymmetric logistic distribution (column 4) and by the multivariate symmetric distribution (column 5). T a b u l k a 4. Odhadnuté pravdpodobnosti pekrocení P(X1 > x1, X2 > x2, X3 > x3), jestlize byla závislost mezi promnnými modelována pomocí asymetrického logistického rozdlení (sloupec 4) nebo pomocí vícerozmrného symetrického rozdlení (sloupec 5). x1 x2 x3 Estimated probability 56.9 10-6 162.6 10-6 3.9 10-6 117.3 10-6 Estimated probability 30.0 10-6 183.1 10-6 2.3 10-6 102.6 10-6 variables is the same and to use a multivariate symmetric logistic distribution. The hydrologists like to estimate the probability given in years rather than in days. The quality of estimates is affected by seasonality and dependence between observations in subsequent days. The effect of these two factors in one-dimensional case was discussed by Jarusková and Hanek (2006). In the multivariate case the situation is similar. According to our experience, despite the mentioned problems the method yields reasonable results. Acknowledgemen. The study presented was partly carried out within the framework of the project MSM6840770002, GACR 201/09/0775.

Journal

Journal of Hydrology and Hydromechanicsde Gruyter

Published: Sep 1, 2009

There are no references for this article.