Stochastic modelling of the spatial spread of influenza in Germany

: In geographical epidemiology, disease counts are typically available in discrete spatial units and at discrete time-points. For example, surveillance data on infectious diseases usually consists of weekly counts of new infections in pre-deﬁned geographical areas. Similarly, but on a different time-scale, cancer registries typically report yearly incidence or mortality counts in administrative regions. A major methodological challenge lies in building realistic models for space-time interactions on discrete irregular spatial graphs. In this paper we will discuss an observation-driven approach, where past observed counts in neigh-boring areas enter directly as explanatory variables, in contrast to the parameter-driven approach through latent Gaussian Markov random ﬁelds (Rue and Held, 2005) with spatio-temporal structure. The main focus will lie on the demonstration of the spread of inﬂuenza in Germany, obtained through the design and simulation of a spatial extension of the classical SIR model (Huf-nagel et al., 2004). In diesem Artikel besch¨afti-gen wir uns mit einem Ansatz, der bereits erfasste F¨alle in angrenzenden Gebieten direkt als erkl¨arende Variablen einbezieht, im Gegensatz zur Modellierung durch latente Gauß-Markov-Zufallsfelder (Rue and Held, 2005) mit r¨aumlich-zeitlicher Struktur. Dazu stellen wir die Ausbreitung von Inﬂuenza in Deutschland mittels einer r¨aumlichen Erweiterung des klassichen SIR-Modells (Hufnagel et al., 2004) in Computersimulationen nach.


Introduction
There has been much recent interest in space-time models for disease counts collected in discrete spatial units and discrete time points. While most of the work has mainly focused on non-infectious diseases, in particular on cancer, recently models for infectious 6 Austrian Journal of Statistics, Vol. 35 (2006), No. 1, 5-20 disease data have been developed. For non-infectious diseases, hierarchical Bayesian approaches have been proposed, where latent parameters follow Gaussian Markov random field (GMRF) models (Waller et al., 1997, Knorr-Held and Besag, 1998, Knorr-Held, 2000a, Lagazio et al., 2001, Lagazio et al., 2003, Schmid and Held, 2004. Common to these models is the assumption that the observed counts are conditionally independent, given the latent parameters. However, the allowance for realistic space-time interaction in GMRFs is non-trivial, one approach that dates back to Clayton (1996) is to use Kronecker product structures (see also Rue and Held, 2005, Section 3.4.3) for interaction parameters while keeping main effects for overall spatial and temporal trends.
In this paper we will focus on a different modelling strategy, where past counts enter explicitly in the disease rate and hence the conditional independence assumption is lost. This class of models, called observation-driven (Cox, 1981), is motivated by the fact that parameter-driven models, such as the GMRF models mentioned above, are not able to capture the epidemic trends observable in data on infectious diseases. Indeed, epidemic models have used such observation-driven models for decades; in particular the class of SIR models (susceptible-infected-removed) has been extensively studied. However, this has been done mainly in a purely temporal and simplistic context, ignoring the fact that global epidemics spread in a spatio-temporal fashion.
A recent approach described in Hufnagel et al. (2004) fills this gap, proposing a spatio-temporal model on two scales (local and global) to describe the spread of the SARS epidemic based on stochastic differential equation models. Watts et al. (2005) developed a metapopulation model which incorporates mixing even at multiple scales. We adopt and extend the model of Hufnagel et al. (2004) and use it to investigate if it is able to describe an influenza epidemic in Germany 2005.
A major requirement in SIR models is knowledge of the number of susceptibles. In surveillance applications, often the whole population is considered as susceptible, due to the lack of available data (e.g. Knorr-Held and Richardson, 2003). An alternative approach is to use a branching process model as an approximation to the SIR model. This class of models has the advantage that it does not require knowledge of the number of susceptibles, however, some form of stationarity is needed to ensure that the stochastic process, describing the number of counts at each time point, does not explode to infinity. Held, Höhle, and Hofmann (2005) have used an extended version of this model in a series of applications from surveillance data. In particular, they showed that maximum likelihood estimation is straightforward and extended the model to the space-time domain using a multivariate branching process formulation. However, the application of this class of model to highly infectious diseases such as influenza is perhaps not suitable, due to the underlying assumption of stationarity.
In the next section, we will start with the classical SIR model and then describe the approach by Hufnagel et al. (2004) to model the spatio-temporal spread of infectious diseases. In Section 3, we develop an algorithm for simulating this model. A central feature of the formulation is that the dispersal of infected cases in space is not necessarily solely local but also global, if necessary. For example, infected cases might travel through air traffic large distances in a small amount of time. Based on data on air and train traffic in Germany, we define such a dispersal rate matrix for administrative regions in Germany and investigate whether simulations from such a model show similar patterns as an influenza epidemic in Germany in 2005. Model parameters are chosen based on external knowledge.
2 From the Standard Deterministic to a Global Stochastic SIR Model

Standard SIR Model
In the SIR model, we divide a population into three categories: Those who are susceptible to the disease (S), those who are infected and infectious (I), and those who are removed from the system because they are recovered and immune, or quarantined, or dead (R). With s, j, and r we denote the fractions of susceptible, infectious, and removed individuals of the total population N . Transitions from one category to another happen according to where α is the rate of an individual's contacts per day which are sufficient to spread the disease, and β −1 is the average infectious period. The infection dynamics in the standard deterministic SIR model is given by the set of differential equations Hence, while recovery follows a linear process, infections occur on high rate only when both the numbers of susceptibles and infectives are sufficiently large. Since we assume a closed population, i.e. ignoring births, non-related deaths, and migration during the relatively short duration of an influenza epidemic, we expect the size of the population to be constant. The fraction of recovered individuals thus reads r = 1 − s − j. The ratio ρ = α/β is called the basic reproduction number and states a decisive parameter for the course of the epidemic: When ρ −1 is greater than the initial fraction of susceptibles s 0 , no epidemic will develop. Otherwise, the epidemic will fall off as soon as the decreasing function s(t) drops below ρ −1 . In case of influenza, an infected individual acquires immunity to the strain he was affected by and can hence not become susceptible during the same wave of flu again. Therefore, there is no need for a transition from state R back to S. However, there are steadily new antigen mutants of the influenza virus coming up, which is why at the beginning of the next epidemic the whole population will be susceptible again.

Stochastic SIR Model
Bearing in mind that the infection and recovery processes are of rather stochastic than deterministic character, we write (1) in terms of stochastic Langevin equations: Austrian Journal of Statistics, Vol. 35 (2006), No. 1, 5-20 where ξ 1 (t) and ξ 2 (t) are independent Gaussian white noise forces, modelling fluctuations in transmission and recovery matters. These are of particular importance during the initial phase when the number of infected individuals is relatively small. The above equations can be derived as a Gaussian approximation to the general stochastic epidemic model (see e.g. Daley andGani, 1999, Section 3.3, andAndersson andBritton, 2000, Section 5.5) in which the total population size tends to infinity.

Excursus: SLIR Model
It is possible to also incorporate a latent status in our considerations, which yields the following transitions, the so-called SLIR model: where ε −1 is the average latent period. Let l denote the fraction of latent individuals of the total population. The differential equations then read in the deterministic case and in the stochastic model, where ξ 3 (t) accounts for noise in the duration of the latent period. Since our objective is the modelling of the spread of influenza, where an individual can normally pass on the virus from the moment of infection, we from now on suppress the consideration of latency. Nevertheless, the following observations can easily be adjusted to the SLIR model (cf. http://www.statistik.lmu.de/˜dargatz/publications).

Global SIR Model
So far, our model describes the spread of a disease in a single closed population under the assumption of homogeneous mixing. But this condition applies only as long as individuals cover relatively short distances-an assumption that is not given in our fully connected world anymore, even if we restrict our focus to a comparatively small area like Germany. As suggested in Hufnagel et al. (2004), we introduce a network of subregions 1, . . . , n of the primarily observed area, each region i having a population size N i being composed of S i , I i , and R i susceptible, infectious and removed individuals. Whilst the local infection dynamics within a subregion is given by the stochastic SIR model as introduced above, the global dispersal between the knots of the network is rated in a connectivity matrix γ = (γ ij ) ij : C. Dargatz et al.

9
The system of stochastic differential equations now changes to , and ξ 5 (t) are independent vector-valued white noise forces which stand for fluctuations in transmission, recovery, and outbound and inbound traffic, respectively.
Since in the global model the single populations are not closed anymore due to migration, the property s i + j i + r i = 1, i = 1, . . . , n, does not necessarily hold. Instead, s i , j i , and r i indicate the fractions of susceptible, infectious and removed individuals as measured by the original population N i . That is why in (2) we also declared the formula for r i .

Keeping the System Closed
Let us focus on the Gaussian white noises ξ j . The components of ξ 1 (t) and ξ 2 (t) (and also of ξ 3 (t)) are all stochastically independent of each other, but we have to introduce a weak form of dependence to the components of ξ 4 (t) and ξ 5 (t) due to the following: Since we assume the area of our n regions to be closed, we have to require The left hand side of this equation reads Obviously, the two sums over i in (3) both equal 0. In order to also let rows (4) and (5) disappear, we correlate the components of ξ 4 (t) and those of ξ 5 (t) among each other such that equality with zero holds almost surely.
For the components of ξ 4 , we proceed as follows (see Knorr-Held, 2000b): Define We hence seek Define the n × n-matrices where I n ∈ R n×n denotes the identity matrix and 1 n = (1, . . . , 1) ∈ R n×1 . Furthermore, and i.e. Q(t) is the covariance matrix of u(t). Then, as required, yielding (6). Unfortunately, the desired property i,j q ij (t) = 0 yields the drawback that Q(t) is not positive definite and hence unsuitable as covariance matrix. Instead of u(t), we hence consider a linear transformation Lu(t) with Draw π(t) = (π 1 (t), . . . , π n−1 (t), 0) with and retransform u(t) = (u 1 (t), . . . , u n (t)) = Mπ(t). We obtain Note that, for any i, we have x i (t) > 0 as long as r i (t) < 1, since for all i ∈ {1, . . . , n} there is a k ∈ {1, . . . , n} with γ ik > 0 (i.e. each district is directly connected to at least one other). However, if x i (t) = 0, the value of ξ (i) 4 (t) does not matter since in (4) it will be multiplied by x i (t).
Obtain ξ 5 in the same way as ξ 4 , replacing x i (t) by

Numerical Scheme
Given initial conditions s i (0), j i (0), and r i (0), i = 1, . . . , n, as well as fixed values for the transmission rate α and the reciprocal average infectious period β, we simulate the epidemic process at discrete, equidistant instants in the time domain [0, t max ]. Define functions a p and b pk , p ∈ {s, j, r}, 1 ≤ k ≤ 5, such that the system of SDEs (2) becomes Austrian Journal of Statistics, Vol. 35 (2006), No. 1, 5-20 for i = 1, . . . , n. Let δ be the (suitably small) time step. For the approximation of the differential equations (8), we apply the Euler-Maruyama approximation scheme for m ≥ 1 and i = 1, . . . , n, where t m = mδ and ξ Kloeden and Platen, 1999).
6. Evaluate ξ 1 , ξ 2 ∼ N (0, I n ) and ξ 4 , ξ 5 with ξ (i) With this transformation, we constantly adjust the fractions of susceptible, infected, and recovered individuals to the current population size of the respective region.

Dataset
The underlying data about incidences of influenza in Germany is taken from the Robert Koch Institute (RKI): SurvStat, http://www3.rki.de/SurvStat, deadline: 8 July 2005. We only consider cases categorized as A or A/B (i.e. no further differentiation), since it is the influenza A virus that is most responsible for national epidemics of the flu. Unfortunately, the data suffers from underreporting. According to estimations of the Federal Ministry of Health and Women, Austria (http://www.bmgf.gv.at), and the Robert Koch Institute (http://www.rki.de), the annual number of influenza cases is approximately 4.5% of the total population. However, only one out of 500 of these cases is reported to the RKI. Moreover, the number of announced cases depends on the number of medical examinations induced and does hence not reflect the actual geographical distribution. In particular, affections will be more clustered in the dataset than in reality.

Connectivity Matrix
The connectivity matrix γ describes the strength of traffic between the subunits of Germany. For its design we take into account the dispersal between adjacent regions, caused e.g. by commuters, and the domestic train and air traffic. Each of these three components is provided with a weight regulating its influence. At district level, we assume that the major part of the traffic between regions arises from commuters. Data from the Federal Statistical Office (http://www.destatis.de/ e home.htm) about the lengths of ways to work lead us to the assumption that about 30% of the employees work in a different district than their home town. Adding private traffic, we obtain an estimated fraction of 16% of the total population that is migrating between districts every day, which is reflected by γ having an average row total of 0.16. We choose the weights of the train and of the flight network to be 1/20 and 1/80 of the traffic between neighbored districts according to the annual amounts of travellers, which are about 200 million in the inter urban rail services and 50 million in the domestic flight connections. Certainly, these weights depend on the kind of disease and time period under observation. For example, the influence of the flight network will be less when considering children's diseases, and during school terms an increasing national mixing rate should be considered.
Within the matrix γ, the strength of migration between two adjacent districts is measured by their densities and numbers of surrounding districts. Our rail network model consists of 57 cities which are served by ICE trains. Data about flight connections was obtained from the OAGflights database (http://www.oagflights.com) and composed as in Hufnagel et al. (2004).
For counties and states, we assume the migration between parts of Germany to be more uniform than in the case of districts. For more details, see the supporting material.

Transmission Rate and Infectious Period
Before being able to run the simulation, we need an estimate of the parameters α and β. Recall that α is the daily number of contacts sufficient for infection an individual has with other individuals, and β −1 is the average infectious period of the disease. Due to these meanings, it is easy to estimate β, but more complicated to guess α. We hence try to estimate the basic reproduction number ρ = α/β. For that, we return to the standard deterministic SIR: Divide the second equation of (1) by the first one and obtain the timeindependent differential equation which has the explicit solution with a constant c. At the very beginning of an influenza epidemic, almost all individuals of the considered population are susceptible, whilst the number of infected should be about zero. With these assumptions, i.e. s 0 = 1 and i 0 = 0, we obtain c = 1. Consequently, Certainly, the term on the right is not constant for the available data. Moreover, as time goes by and safety measures like vaccination or isolation are increased, the reproduction number is going to fall. However, we assume ρ to be constant in time, but varying in space. From the application of formula (10) to our district-level data and lim t→∞ s(t) = 0.955 (compare Section 4.1) and lim t→∞ j(t) = 0, we set where d i is the population density of region i. This relation reflects the intuitively clear fact that the disease is more likely to spread in areas with high population densities. Since the infectious period of influenza usually lasts for four to five days, we assume β = 2/9 and calculate α i via βρ(d i ), i = 1, . . . , n.

Simulation Results
In this section we want to present the results of our simulations. We run the program for different starting scenarios for both the deterministic and stochastic model and try the effects of the parameters on the outcomes. Although we draw comparisons between the (highly under-)reported and the simulated data, we want to emphasize that the objective of this paper is neither to predict the future nor to exactly repeat former data, but to give an idea of the spatio-temporal spread of influenza and the effect of stochastic fluctuations on its outbreak. Results of the simulations are returned as animated maps of Germany, which are available at http://www.statistik.lmu.de/˜dargatz/publications.

Long-Term Simulations
We repeatedly run our simulation at district level with α and β as estimated in Section 4.3 and with an initial number of infectious individuals according to week 5/2005 in our dataset (Section 4.1). Though being probabilistic, the simulations generally yield the same pattern (see also Figure 1): From South Germany, where an increased level of prevalence was observed in week 5, the disease bounces to Bremen and at the same time moves via Frankfurt to North Rhine-Westphalia and Lower Saxony, from where it spreads to the Eastern part of Germany and finally affects the whole nation. This shows surprisingly good agreement with the actual course of the influenza epidemic in 2005 as demonstrated at http://influenza.rki.de. In the last graphic of Figure 1, we interpret the increased morbidity at the national borders, especially in North and East Germany, as edge effects.
As mentioned in Section 4.1, cases in our dataset appear more concentrated in one region than they probably are, which might be due to different reporting behavior. In contrast to that, our simulation does not leave any district unaffected. The final size of the epidemic, which is the fraction of individuals that have been affected by the disease 16 Austrian Journal of Statistics, Vol. 35 (2006), No. 1, 5-20  at the end of the epidemic, is about 4.5% on average. Figure 2 shows the simulated final sizes of the epidemic in each district after 100 iterations. The duration of the outbreak in our simulations is about 150 days, which is twice as long as the actual continuance of the influenza wave from week 5/2005 on. We see the reason for this in a slow onset of the epidemic-caused by too small numbers of initially infectious individuals-and the reproduction number ρ not decreasing in time but being constant, which contradicts the real situation (see Section 4.3). Investigations show that the amount of initially infected individuals hardly affects the final size of the epidemic-as long as the fraction of susceptibles at the very beginning does not fall below ρ −1 -but rather shifts the starting point of the major outbreak. To our surprise, changes in the time step δ do not really matter regarding the speed, course and intensity of the spread, which means that our numerical scheme already yields good results for relatively large δ.

One Week's Forecasts
We initialize the computer program with data from various weeks of our dataset and simulate the following week's spread of the epidemic repeatedly. Figure 3 compares the distributions of the proportions of infected individuals of the total population in week 7/2005 for 2000 simulations of the stochastic model with the respective deterministic result and actual data for the three considered divisions of Germany and δ ∈ { 0.1, 1} (measured in days). It turns out that the deterministic outcomes are similar for all resolutions and both values of δ, but do not agree with the dataset, which is not surprising due to the high level of underreporting mentioned in Section 4.1. On county and state level, the stochastic results seem to be normally distributed with the deterministic value as mean, where the variance is smaller for δ = 0.1 than for δ = 1. In contrast to that, the probabilistic modelling on district level yields results that are larger than in the deterministic case (for δ = 1 much more clearly than for δ = 0.1), though apparently also normally distributed. We suspect the reason for this offset in the starting distribution of infectious individuals: While there are reported affections in almost all counties and states, prevalences are concentrated on relatively few districts. When in our simulation the epidemic spreads to those districts with the initial fraction of infectives being zero, the stochastic fluctuations in this dynamics are kind of bounded to one side (compare with step 9 of the algorithm in Section 3). Obviously, this effect is deeper for larger values of δ. If we focus on those few districts where morbidity was already present at the beginning of the simulation, we obtain rather satisfying results already for δ = 1 (see Figure 4). For these districts, the actual data lies within the range of the stochastic results. We conclude that the stochastic simulation at district level is rather inappropriate as long as we consider relatively short terms or cannot improve the quality of the underlying data.

Conclusion and Outlook
In this paper, we presented a global extension of the classical SIR model as well as technical details for its implementation and initialization. Computer simulations provided quite realistic demonstrations of the spread of diseases in Germany. The model assumes that some percentage of susceptibles and infectives of one region move to another region and become part of the population in the other region. Since most trips considered here are day trips, a possible alternative model would be to keep the populations in each region fixed and to assume that susceptibles and infectives have contacts between regions.
In future work, we will further refine the model both by considering this modification and e.g. by involving time-dependent parameters (cf. Sections 4.3 and 5.1). Furthermore, we intend to deal with the question of finding surveillance strategies in case of a sudden outbreak of an epidemic, like specific isolation, vaccination or observation of migration. One main purpose of our research will certainly involve the application of more formal statistical inference techniques for estimating the model parameters based on available data from surveillance databases.