Hierarchical Models for Mitochondrial DNA Sequence Data

We introduce a Bayesian hierarchical model for mitochondrial DNA sequence data, which is fitted via acceptance-rejection algorithms. The model incorporates parametric models of population history explicitly as well as a mutational process allowing for a simultaneous parameter estimation whose importance has become increasingly clear in many recent studies. The model is applied to a sample of DNA sequences from the Italian population. Abstract: Wir stellen ein Bayes’sches hierarchisches Modell für mitochondrialen DNA Sequenzdaten vor, das mittels Acceptance-Rejection Algorithmen angepasst wird. Das Modell enthält explizit parametrische Modelle für die Entwicklung der Population wie auch einen Veränderungsprozess und erlauben eine simultane Parameterschätzung. Die Wichtigkeit dieser Vorgehensweise ist durch viele aktuelle Studien ganz deutlich nachgewiesen worden. Das Model wird auf eine Stichprobe von DNA Sequenzen der Italienischen Population angewandt. Wir stellen ein Bayes’sches hierarchisches Modell für mitochondrialen DNA Sequenzdaten vor, das mittels Acceptance-Rejection Algorithmen angepasst wird. Das Modell enthält explizit parametrische Modelle für die Entwicklung der Population wie auch einen Veränderungsprozess und erlauben eine simultane Parameterschätzung. Die Wichtigkeit dieser Vorgehensweise ist durch viele aktuelle Studien ganz deutlich nachgewiesen worden. Das Model wird auf eine Stichprobe von DNA Sequenzen der Italienischen Population angewandt.


Introduction
Inference about population histories and evolutionary processes are not only of intrinsic interest but are also crucial to the interpretation of genetic data in a wide range of applications which vary from molecular biology and genetic medicine to forensic sciences (see Bataille et al., 1999;Foreman et al., 1997;Jorde et al., 2001).
The control region (sometimes referred as D-loop) of the mitochondrial genome is widely used in studies of human population to address question concerning genetic variation within species.This is due to the maternal inheritance of mitochondrial DNA, the absence of recombination and the high mutation rate (Cann et al., 1987).
Human control region sequences evolve according to a complex pattern that makes analysis difficult.In fact underlying a sample of DNA sequences data is a structure shaped by dependencies that reflect the ancestral relationships between the sequences and are affected by historical patterns of migration, mating behavior and population growth, as well as mutation and selection (Cavalli-Sforza et al., 1994).In the absence of recombination, these relationships can be represented by a genealogical tree for which each leaf corresponds to a sequence at the present time while the root of the tree represents the most recent common ancestor of all the sequences in the sample (Wilson et al. (2003)).
Although the underlying relationships are crucial in modelling the dependence structure of a sample of DNA sequences, they are in fact ignored by traditional methods, most of which are based on the distribution of pairwise differences, obtained by comparing pairs of sequences and recording the number of pairs with 0, 1, . . .differences, or summary statistics like the number of segregating sites, i.e. the number of single positions or loci in a sequence which experienced a mutation (see Bonneuill, 1998;Tavaré et al., 1997).
In recent years important advances have been made in developing tree reconstruction methods and computational techniques such as Markov chain Monte Carlo and importance sampling (Stephens, 2001).Tree reconstruction methods can give insights into the mode of evolution of the genomic region studied.In particular, coalescent theory provides a framework with which to incorporate parametric models of population history explicitly.Coalescent models, which describe the evolution of a sample of DNA sequences in terms of stochastic processes, allow application of statistical techniques for parameter estimation and model testing (Stephens and Donnelly, 2000).
The trade-off is that the computational complexity of the analysis can increase substantially and implementing these models and algorithms is challenging.Furthermore they assume a constant rate at which mutations occur while an important feature of mitochondrial sequence evolution is the variation of rates among sites.To gauge the contribution of hot spot, i.e. positions at which substitutions accumulate predominantly, to the high rate estimate, it is necessary to infer the rate for each polymorphic locus in the sequence (see Wakeley, 1994;Swofford et al., 1995).
The aim of this paper is to introduce a Bayesian hierarchical model to estimate demographic and mutational parameters of a sample of mitochondrial DNA sequences.In order to take into account the underlying ancestral relationships between sequences we use coalescent models as prior distributions.
In the next section we give a brief outline of the standard coalescent process and the coalescent with population growth and then we introduce the hierarchical model.In Section 3 we discuss the choice of the prior distributions along with the results obtained by the application of the model to a sample from Italian population.

Standard Coalescent Process
The coalescent is a stochastic model for the genealogical tree representing the ancestral relationships between a sample of DNA sequences.It approximates the distribution of genealogical trees under an important class of neutral population genetics models, including the celebrated Wright-Fisher model of a random-mating population of constant size N (see Hudson, 1990;Kingman, 1982).To recover this approximation, 1 unit of 'coalescent' must be interpreted as N generations, where N is the effective population size, say the number of adults in a population contributing offspring to the next generation (Hartl and Clark, 1997).
Coalescences occur only between pairs of individuals.This process may be thought of as generating a binary tree, with leaves representing the sample sequences and vertices where ancestral lines coalesce.
Coalescent time runs backward, with time t 0 ≡ 0 denoting the present and time t k , k ∈ {1, 2, . . ., n − 1}, denoting the time of the k-th most recent coalescent event which take place between n individuals.In particular, t n−1 denotes the time of the most recent common ancestor (TMRCA) of the sample.
The between-coalescent intervals W k = t n−k+1 − t n−k during which the sample has k distinct ancestors have independent exponential distributions for t > t .An important quantity associated to a coalescent tree is the height, defined as By definition the expectations of W k are equal to 2/k(k − 1) and so the expectation of T is given by E(T ) = 2 1 − 1 n which approaches 2 units of coalescent time, equivalent to 2N generations as the sample size gets large.

Coalescent with Population Growth
The effect of variable population size is to change the joint distribution of the times t k .Suppose that the population size at the time of sampling is N .The population size at time N t 0 λ(s)ds generations ago will be written as N λ(t).The standard coalescent model is a special case with λ(s) ≡ 1 of the model in which equation ( 1) is replaced by where Λ(t) ≡ t 0 ds/λ(s) (Hudson, 1990).Let us consider pure exponential growth at rate R, i.e.
A more realistic scenario is a two parameter model for which corresponding to a population of constant size N until N t g generations ago, after which it grows at rate R per generation to reach its current size N c , where Under this model the coalescent time distribution (2) becomes

A Hierarchical Model
Mitochondrial DNA sequences data are a string of letters each of ones denotes the allele, i.e. the possible state of the DNA sequence at a locus.Letter A stay for adenine, C for cytosine, G for guanine, and T for thymine.However, it is a common situation that a locus exhibits just two variants across individuals, for example T ⇔ C or A ⇔ G. Then for each locus we arbitrarily choose and fix one of the two variants as usual in singlenucleotide polymorphism allele frequencies models (Nicholson et al., 2002).We consider the setting in which we have a collection of L loci from a modern population.Suppose n j is the number of individuals typed at the j-th locus in the population.At the lowest levels of the hierarchical model, the number of copies x j of the chosen variant at locus j has a binomial distribution where α j ∈ [0, 1] is the probability of the chosen variant at locus j.
We model the dependence structure between the modern population and the ancestral population to that sampled through a mutational and a demographic process.First introduce a collection of unobserved quantities, one for each locus: π j , j = 1, 2, . . ., L which plays the role of the allele frequencies in the ancestral population.Then is the probability of a mutation after tN generations (see Weir, 1990), whereas is the probability of changes between the two variants in the j-th locus.Finally to complete the hierarchy we put independent priors on π, µ, N, t.The choice of the prior distribution is discussed in the results section.

Simulation Methods
In this section we describe a simulation approach which is based on the acceptancerejection method (see Ripley, 1987) to generate observations from the posterior distribution of demographic and mutational parameters.We assume that, before observing the data, N and µ are mutually independent random quantities and independent of coalescent times t and the ancestral population allele frequencies π.The prior distributions of N and µ should be chosen so as to summarize P. Berchialla 57 the information available, for example from relevant genetic and anthropological studies.Typically, such information will not uniquely specify the distribution, therefore it is prudent to consider several different plausible choice and investigate the sensitivity of conclusions.Algorithm 1 1. simulate N and µ from their specified prior distribution; 2. simulate π j and T from a symmetric Beta(p) and an Exponential(2) distribution, respectively; 3. calculate α j according to the equation ( 5); 4. keep (π j , T, N, µ) with probability u defined by When considering the coalescence time T the algorithm can be modified as follows Algorithm 2 1. simulate N and µ from their specified prior distribution; 2. simulate π j from a symmetric Beta(p) distribution and the W k as independent Exponential(k(k − 1)/2) variables, k = 1, 2, . . ., n; 3. evaluate T by T = n k=2 W k ; 4. calculate α j according to equation (5); 5. keep (π j , T, N, µ) with probability u defined by (6).
When allowing for population growth, the above algorithm can be readily modified in the following one.
Finally the preceding algorithm can be modified as follows to allow for the demographic scenario provided by the two-parameter exponential growth.
Austrian Journal of Statistics, Vol. 36 (2007), No. 1, 53-64 Algorithm 4 1. simulate R and then simulate time t g which corresponds to the time at which the population starts its exponential growth and N by and evaluate T = n k=2 W k = t n−1 ; 3. simulate µ and π from their prior distributions; 4. keep (π j , T, N, N c , R, µ) with probability u defined by (6).
All the simulation routines were written using the R programming language (R Development Core Team, 2006).

Results
A data set consisting of mitochondrial DNA sequences from a sample of 49 Italian individuals was collected to illustrate the hierarchical model described above.Data are available on the FBI Mitochondrial DNA Population Database (Monson et al., 2002).Each sequence is composed of the first 360 base pair segment of the control region, corresponding to positions 16024-16383 in the human reference sequence of Anderson et al. (1981).
In the mtDNA sequences, 28 polymorphic loci which showed variation across individuals were identified.The settings of loci was considered evolving independently.Such condition typically arises unless loci are very close on the same molecule.
For setting up prior distribution in the standard coalescent model with a constant population size, we refer to Tavaré et al. (1997) They estimate the effective population size N to be of order 5000 individuals.To allow uncertainty a gamma prior distribution with mean 5000 and shape 5 was considered (Table 1).Such distribution concentrates largely on values between 0 and 6000 and it is approximately centered around the value 4900 which is also supported by Hammer (1995) and Fullerton et al. (1994).Alternatively a log-normal distribution with mean and standard deviation equal to 9 and 1 respectively was considered (see Wilson et al., 2003).Since the resulting posterior distributions were similar, results were summarized only for the posterior distribution obtained by assuming the gamma prior in Table 2.
In the coalescent model with exponential population growth, uncertainty of growth parameter R was modelled by means of a gamma distribution with mean 0.005, shape 2 and rate 400.According to Wilson et al. (2003) it was reasonable to assume an effective population size N distributed as a gamma with mean 3000 and shape and rate parameters equal to 3 and 10 −3 respectively, while for the ancestral population size N c , parameter log(N c /N ) was modelled as a gamma distribution with shape and rate equal to 5 and 1, respectively.From the relationship 7 along with the assumption that log(N c /N ), N , and R were mutually independent (see Wilson et al., 2003) the distribution of time t g at which population started its growth was derived.According to considerations as in Tavaré et al. (1997), uncertainty of the mutation rate µ j was given in the form of a gamma ∼ (2, 4.94 × 10 −8 ) for all j (see Tables 3 and 5).A value of 25 years was assumed for the generation time (Hein, 2004).Finally a prior distribution on allele frequencies in the ancestral population was taken to be a symmetric beta and alternatively a uniform distribution (Nicholson et al., 2002).Sensitivity analysis showed that conclusions did not depend on the choice of the prior distribution, thus results in Table 7 were summarized for the choice of a symmetric beta with parameter p = 0.1, only.
In Table 2, the estimated TMRCA using algorithm 2 was two order of magnitude higher than TMRCA estimated via algorithm 1.This is mainly due to the effect of mutations which stretch out the intervals between coalescence times.
From Table 7, positions 16146-16173-16190-16194-16257 showed a mean mutational rate which was lesser than other positions both for algorithm 3 and algorithm 4 revealing a mutation rate heterogeneity of the mitochondrial DNA control region.
To test the ability of the inference procedure introduced and recover accurate estimates of parameters a simulation study was undertook.We simulated 50 data sets from algorithm 2. The parameters underlying each simulation were obtained via the independent prior distributions discussed above.
In Table 8 let χ be the indicator function which assume value equal 1 if the 100p% posterior interval includes the correct value, 0 otherwise.The observed average of χ over the data sets formed a Binomial(50, p) proportion.Since there were not appreciable differences results were summarized for TMRCA, effective population size N and just for one mutational and one ancestral allele frequencies parameter.The number of data sets for which 100p% posterior interval included the true parameter value was slightly lesser than the mean, however the differences did not exceed three standard deviation.Ancestral population allele frequencies π j were the parameters with the greatest differences among achieved and nominal coverage.

Discussion
We fitted a hierarchical model for mitochondrial DNA sequences data by using a fully Bayesian approach implemented via acceptance-rejection algorithms.A feature of these algorithms is that they are usually efficient in term of time consuming.Otherwise, when the acceptance probability is very small the algorithm can be computational expensive and alternatively the model may be fitted via Markov chain Monte Carlo methods.How-ever, the method described allows much more flexibility to explore the effects of different modelling assumptions.It is straightforward to adapt the algorithms given to include other demographic scenarios such as coalescent with population splitting (see Wilson et al., 2003).Furthermore the model itself allows for different sample size across loci.Several authors (see Tavaré et al., 1997;Wilson and Balding, 1998) have implemented Bayesian methods in order to make inferences on the mutation rate µ and the effective population size N separately while, in the absence of other information, the statistical features of the sequence data are affected by the value of the compound mutation parameter 2N µ (Yang, 1996).On the other hand demographic parameter estimates along with site-specific rate estimates could be used to refine models of molecular structure allowing for a better understanding of the forces and mechanisms that affect sequence evolution, Wakeley (1994).The proposed hierarchical model is thus motivated by the importance of simultaneous parameter estimation, especially whether mutational rate heterogeneity exists among sites.In fact, it allows to get an estimation of the substitution rate at the nucleotide level pointing out those sites which are more likely to experiencing a mutation.
Results obtained from the implementation of the model are consistent with other studies.In fact, the estimated value of the effective population size in the standard coalescent model (Table 2) is about 5000 which is consistent with the results of Jorde et al. (2001).On the other hand TMRCA estimated using algorithms 3 and 4, see Tables 4 and 6, is consistent with the coalescence time estimated in the European population using the last intron of the ZFX gene (Jaruzelska et al., 1999).
Recent studies (see Denver et al., 2000;Heyer et al., 2001) reveal a mitochondrial substitution rate that is two orders of magnitude higher than previous indirect estimates.This is mainly due to multiple mutations affecting coding function as well as mutational hotspots.Denver et al. (2000) gives an overall mutation rate equal to 1.6 × 10 −7 per site per generation (±3.1 × 10−8) which support results summarized in Table 7, except for position 16173 which is a slow site accordingly to Heyer et al. (2001).
An interesting generalization of this approach relies on the fact that it naturally handles missing data and could be extended to incorporate correlations between loci.
With regard to the simulation study, from Table 8 should be observed that ancestral population allele frequencies π j are those parameters with the greatest differences among achieved and nominal coverage.This is probably due to the fact we simulated them using a symmetric beta distribution which assign the same probability to rare and common variant.Anyway this seem unlikely to have a large effect on inferences since for most parameters of interest there is a good agreement between the predicted and the actual coverage.
A limitation of this work concerns the sampling strategy.In our analysis we supposed that the sample was indeed "random".In practice this is difficult to arrange and the sensitivity of estimates and inferences to non-random sampling should be quantified.

Table 1 :
Demographic parameters from prior distribution for algorithms 1 and 2

Table 2 :
Demographic parameters from posterior distribution for algorithms 1 and 2

Table 3 :
Demographic parameters from prior distribution for algorithm 3

Table 4 :
Demographic parameters from posterior distribution for algorithm 3

Table 5 :
Demographic parameters from prior distribution for algorithm 4

Table 6 :
Demographic parameters from posterior distribution for algorithm 4

Table 7 :
Mutational parameters from posterior distribution for algorithms 3 and 4

Table 8 :
Simulation study consisting of 50 data sets from the algorithm 2