Statistical Modelling of Annual Maxima in Hydrology

In this paper conditional modelling of annual maxima for predicting flood water is considered. The aim is to predict flood water of rivers, where no data about discharge but data about properties of the catchment of the rivers are available. A generalized linear mixed model is used to model the annual maxima depending on properties of the catchment and to take the correlation among measurements of one year into account. The fitted means and variances according to this model are plugged into the method of moment estimates of the parameters of the Gumbel distribution to obtain some extreme quantiles. These quantiles are commonly used to predict flood water of rivers. This approach is applied to data from Styria (Austria). The result is a satisfactory model for predicting flood water for rivers, where no data about the discharge are available. Zusammenfassung: In diesem Beitrag wird das bedingte Modellieren von jährlichen Maxima zur Vorhersage von Hochwasser betrachtet. Das Ziel ist es die Höhe von Hochwasser von Flüssen vorherzusagen, bei denen keine Daten bezüglich des Abflusses aber bezüglich der Eigenschaften der Einzugsgebiete der Flüsse existieren. Es werden generalisierte lineare Mischmodelle verwendet um einerseits die jährlichen Maxima in Abhängigkeit von den Eigenschaften der Einzugsgebiete zu modellieren und anderseits die Korrelation zwischen den jährlichen Maxima verschiedener Flüsse eines Jahres zu berücksichtigen. Die geschätzten Erwartungswerte und Varianzen unter diesem Modell werden in die Momentenschätzer der Parameter der Gumbel Verteilung eingesetzt, um Schätzer für extreme Quantile zu erhalten. Diese Quantile werden häufig verwendet um Hochwasser von Flüssen vorherzusagen. Diese Methode wurde an Flüssen in der Steiermark (Österreich) angewendet. Das Resultat ist ein zufriedenstellendes Modell zur Vorhersage von Hochwasser für Flüsse, bei denen keine Abflussdaten zur Verfügung stehen.


Introduction
Annual maxima of the discharge of a river are commonly used to predict flood water.In this paper annual maxima from several rivers are modelled to predict flood water for rivers, where no data about the discharge are available.
The analyzed data are annual maxima of discharge from rivers in Styria (Austria).In hydrology it is well known, that the discharge and hence the annual maxima are influenced by the properties of the catchment of a river.Such a catchment is defined as Austrian Journal of Statistics, Vol. 35 (2006), No. 1, 21-30 the area of the landscape, where all the rain falling on this area discharges into the river.These properties can be easily obtained from a Geographic Information System (GIS) and are available for any river in Styria.Thus, the idea was to model annual maxima depending on these properties to obtain an appropriate model for predicting flood water.Because the distribution of such annual maxima is certainly non-normal, we do not consider any linear regression models.However, modelling can be done within the context of quasi-likelihood estimation in the generalized linear model (GLM) framework.To take the correlation among observations of different rivers at one year into account, the model is augmented with a random effect, which leads to the broad class of generalized linear mixed models (GLMMs).First analysis of the annual maxima indicates, that it is reasonable to assume temporal independence over the years.Therefore, only spatial dependency of the data is considered.The fitted values are plugged into the method of moment estimator of the parameters in the extreme value distribution to obtain estimates of certain quantiles.These are often used to predict flood water.This approach provides satisfactory estimates of flood water and can be applied for rivers, where only data about the properties of the catchment are available.
The analysis of extreme values has a long history and goes back to the traditional approach introduced by Gumbel (1958).There is an enormous literature on this topic, especially applied on environmental data such as rainfall and river heights.Davison and Smith (1990) presented a method for modelling univariate extremes in dependency of explanatory variables.Their approach is based on the exceedance over a threshold, with the assumption, that the difference between the observations and the threshold follow a generalized Pareto distribution.An extension to multivariate extremes is given in Coles and Twan (1991) and this technique is later on applied to extremes of areal rainfall in Coles and Twan (1996).In this work the extremes were modelled by means of explanatory variables taking their spatial dependence into account.A critical point of these models is the choice of a suitable threshold.If this value is chosen too large, then only some of the data exceed this value and all of them can be considered as extremes.If the value is, however, small, then we have many observations above it but most of them are not extremes at all.To avoid this problem we develop a method based on the classical extreme value theory, which additionally takes the spatial dependence into account.

Classical Extreme Value Theory
By means of classical extreme value distributions we are often able to analyze the statistical behavior of y = max{x 1 , . . ., x n } , where x 1 , . . ., x n is a sequence of independent identically distributed (i.i.d.) random variables having some distribution function F .Such a sequence usually represents values measured on a regular time scale.In our application, these x i 's (i = 1, . . ., n = 365) represent daily measured maxima of discharge of a river, so that y is the annual maximum of the discharge.As y is the maximum of a block of values, it is often denoted as block maximum.In general F is unknown and therefore the distribution function G of y can not be calculated exactly.But for large n (as in our case), the block maximum y follows one of three extreme value distributions, known as the Gumbel, Fréchet and Weibull distribution (Coles, 2001).These three distributions can be combined into a single family of distributions, the Generalized Extreme Value (GEV) family, with distribution function defined on the set {y | 1 + ξ(y − λ)/ν > 0}, where the parameters satisfy −∞ < λ < ∞, ν > 0 and −∞ < ξ < ∞ and are usually referred to as the location, scale and shape parameter, respectively.In this parametrization the cases of ξ > 0 and ξ < 0 correspond to the Fréchet and Weibull distribution, respectively.The special case ξ = 0 is known as the Gumbel distribution with distribution function The theoretical moments according to this distribution are where γ 0.57722 is Euler's constant.The advantage of this parametrization is, that for estimating the parameters no prior knowledge about the distribution of y is necessary.The data themselves determine, which of these three distributions is appropriate.Thus, we first fit a model allowing for all three parameters and then we subsequently test for the necessity of the shape parameter (Hosking, 1984).Because there is strong evidence that the annual maxima in our application follow a Gumbel distribution, we restrict our attention only onto this special member of the GEV family in the remainder.
Consider now a random sample of n annual maxima y = (y 1 , . . ., y n ) from a Gumbel distribution (1).Then the log likelihood function of this sample is To find the maximum likelihood (ML) estimates, the derivatives of (3) with respect to the parameters λ and ν have to be solved.There is no analytical solution to the system of these two equations, but standard iterative optimization methods can be applied in a straightforward way.An alternative method based on the empirical mean and variance is the method of moments.The theoretical moments in (2) yield estimates λ = ȳ − νγ , and ν = 6s 2 y /π 2 , (4) where ȳ and s 2 y denotes empirical mean and variance of our sample y.A common question in hydrology is: What is the highest level the flood water will exceed once every T years?This is often denoted as T -year threshold or return level.In extreme value theory this return level is defined as the quantile y q with q = 1 − 1/T , i.e.G(y q |λ, ν) = 1 − 1/T .Thus, in order to predict extreme flood water of a river for several return periods T , we are very interested in some extreme quantiles of a Gumbel distribution.These quantiles are obtained by inverting (1) resulting in where G(y q |λ, ν) = 1 − 1/T and T is the return period of interest.For example, if y i are annual maxima and the given return period is T = 100 years, then y q is the level, which is expected to exceed once every 100 years.
If we are interested in predicting flood water for a specified return period, we first estimate the parameters and then plug them into the quantiles (5) (see e.g.Coles, 2001).Of course this is only possible, if concrete data on such annual maxima are available.Sometimes there is only information about catchment properties available.In such cases it would be very useful to have some knowledge about the relationship between annual maxima of the discharge of the river and properties of the corresponding catchment.

Modelling Annual Maxima
Now the focus is on modelling the mean of non-identically distributed annual maxima to later use these estimates in the return levels (5).
Consider block maxima y it of i = 1, . . ., n rivers observed at time t.This can be annual maxima of several years from different rivers.It is assumed, that the maxima y it of a subject are independent and Gumbel distributed, that is y it ind ∼ Gumbel(λ i , ν i ).Furthermore, we assume that the parameters λ i and ν i of two different subjects are independent.Additionally, some explanatory variables x it = (x it1 , . . ., x itp ) are observed.The idea is to estimate the mean and the variance of y it depending on such explanatory variables and plug this estimates into the method of moment estimates of the parameters of the Gumbel distribution.If the mean-variance relationship of y it is known, then it can be modelled by a quasi-likelihood approach (Wedderburn, 1974) within the GLM framework (McCullagh und Nelder, 1989).In such GLMs the mean of a response variable is modelled as a function of the linear predictor, i.e.

E(y
where β is the p-dimensional column vector of interest and g(•) is the so called link function, which links the linear predictor to the mean.The variance of y it is assumed to be proportional to a specified function of the mean and is thus given by var(y it ) = φV (µ it ) .holds, because the shift in the intercept, σ 2 z /2, is constant over time.Now, for a given mean-variance relationship the estimated mean μi and variance φV (μ i ) can be plugged into the method of moment estimates of the parameters of the Gumbel distribution and the return level based on this approach can be estimated.In particular cases, the calculation of the return values can be simplified.Consider a quadratic mean-variance relationship, as it is the case for Gamma distributed responses.Then the estimates (4) based on the fitted mean μi and variance φV (μ i ) are and the return levels (5) simplify to This approach allows to predict return levels based on the estimation results of GLMMs.

Application
The data analyzed in this study are annual maxima of discharge measured at 102 rivers in Styria, Austria.The lengths of the observation times are between 10 and 52 years, giving an unbalanced data set.Empirical analysis suggested that the annual maxima follow a Gumbel distribution.For each river, there are various properties of its catchment available.Amongst these properties there is the catchment's area, the mean altitude dtm, the average amount of rain in this area, drainage density gd and types of land use, like the proportions of forest and no-vegetation (noveg).The idea now is to model the annual maxima in terms of these properties, because this information can be easily obtained from a GIS for any river in Styria.Thus, the mean annual maximum is assumed to be a function of these properties.Because the hydrogeological conditions are also relevant, the rivers were grouped into five homogeneous regions.Assuming that the annual maxima of a river are uncorrelated over time and that the annual maxima of different rivers in the same year are also uncorrelated, the mean response is analyzed within the framework of ordinary GLMs.For this purpose an appropriate link function and a suitable mean-variance relationship has to be specified.A log-link model is applied because the mean discharge of a river has to be a positive quantity.To get a first glimpse of the true mean-variance relationship, we generate the scatter-plot of river specific empirical means against their variances in Figure 1.The point pattern reveals a quadratic mean-variance relationship, which is estimated by the function 0.215µ 2 .Because of that, a quasi-likelihood approach is utilized based on a log-link model for the mean and on a quadratic variance function for the responses.A variable selection procedure was then applied to select the best fitting set of explanatory variables.Finally we found the model log(µ) = region * log(area)+noveg+rain+gd+dtm+forest+dtm.forest , with dtm.forest = √ dtm • forest.This model results in an estimated dispersion parameter of φ = 0.267, which is slightly larger than the estimate from fitting a quadratic curve through the empirical means/variances in Figure 1.However, as the annual maxima of discharge usually result from heavy rain storm events, we should expect that observations from the same year are presumably positively correlated.This might be due to the fact that heavy rain storms are not restricted to a small area or to a single catchment of a river.Thus, this spatial correlation should be taken into account and considered in the model.We extend the model just found before and incorporate some additional random effects which are specific for each year.Therefore, the considered model includes an intercept for each region, region specific coefficients for all explanatory variables and an random intercept for each year.In Table 1 estimates of the region specific intercepts are listed.The estimated coefficients in the reference region 1 are given in Table 2 and their deviations for regions 2 to 5 are in Table 3.
The estimated dispersion parameter under this model is φ = 0.199, which is much closer the value from the model in Figure 1.Of course it is still smaller than the respective fixed effects result, as the random effect now also explains some variability of the responses.
Because the main interest is on estimating return levels and hence on quantiles of the Gumbel distribution, the fitted mean and variance of each river was plugged into (7) to obtain the estimated quantiles (8).This was done for the results of both considered Austrian Journal of Statistics, Vol. 35 (2006), No. 1, 21-30   models.Then the obtained quantiles are compared to the reference quantiles based on the ML estimates of the parameters of the Gumbel distribution.In Figure 2 the reference return levels are plotted against the return levels based on the fitted mean and variance of the GLMM.
To compare both models we consider respective mean residual sum of squares.The residuals are defined as differences between reference return levels and return levels based on the GLM and on the GLMM.Their values are 4524 for the GLM and 4252 for the GLMM.This confirms the impression when comparing estimates of the dispersion parameter, namely that an additional random effect leads to a better result.

Results and Further Investigations
We present a method to predict flood water for rivers, even when no data on the discharge of the river is available.It is based on modelling the annual maxima depending only on some properties of the catchment of the river.This enables to predict flood water for any river, as soon as some characteristics of its catchment are known.A GLMM is utilized to model the mean of these annual maxima, mainly because allowing for random intercepts enhances the goodness-of-fit and it also accounts for an appropriate correlation structure between measurements within the same year.
As this method applied to data on rivers from Styria provides satisfactory prediction of flood water, it has some limitations.First it can be only applied, if we assume that the extremes stem from a Gumbel distribution.Second, as with the classical univariate approach this method is wasteful of data, thus it should be only applied if sufficiently many data is available.Whether this is compensate by analyzing data of several subjects together is an open question.Finally, confidence interval for return levels are not directly available and method for calculating such confidence intervals have to be developed.
As the dispersion parameter plays a main role for predicting return levels, further work has to be done to compare different estimates of this parameter.Second, prediction intervals for the return levels are of interest, thus it should be investigated, how at least approximative prediction intervals can be obtained.And finally, if we are not able to assume, that the maxima follow a Gumbel distribution, how can this method be extended onto the entire GEV family.

Figure 1 :
Figure 1: Empirical means and variances of annual maxima from 102 rivers and a fitted quadratic (variance) model through these points.

Figure 2 :
Figure 2: Comparison of the predicted return levels for a return period of 100 years based on the ML estimates and the fitted means and dispersion parameter of the GLMM.

Table 1 :
Estimated intercept and standard error for each region.

Table 2 :
Coefficients and standard errors for the reference region 1.

Table 3 :
Coefficients and standard errors for the deviations of each region from region 1.