Markov Chain of Conditional Order : Properties and Statistical Analysis

The paper deals with finite Markov chain of conditional order, that is a special case of high-order Markov chain with a small number of parameters. Statistical estimators for parameters and statistical tests for parametric hypotheses are constructed and their properties are analyzed. Results of computer experiments on simulated and real data are presented.


Introduction
Finite Markov chain of the order s (1 ≤ s < ∞) described by Doob (1953) is a well-known universal mathematical model to analyze long memory discrete-valued time series in many applied fields.It is used for statistical data analysis in genetics (see Waterman 1999), economics (see Ching 2004), signal processing (see Li, Dong, Zhang, Zhao, Shi, and Zhao 2010) and other areas.
Unfortunately, there is a significant disadvantage of this model.It has exponential complexity since the number of independent parameters D(s) of the N -state Markov chain of the order s increases exponentially w.r.t.s: Because of the "curse of dimensionality" to identify this model one needs time series of big size (length of time series) n ≥ D(s) not available in practice Kharin (2013), Kharin (2005), Kharin and Shlyk (2009).Therefore, small-parametric or parsimonious models are developed to overcome this difficulty.These models are special cases of the s-order Markov chain, but the number of parameters required to determine the one-step transition probability matrix is much less than D(s).Let us give some examples of such parsimonious models: the Markov chain of the order s with r partial connections (see Kharin and Petlitskii 2007), Raftery model (see Raftery 1985), variable length Markov chain (see Buhlmann and Wyner 1999).For example, the conditional probability distribution of the current state of the Markov chain of the order s with r partial connections depends not on all s previous states, but only on r selected states.This paper is devoted to a new parsimonious model called Markov chain of conditional order proposed by authors in Kharin and Maltsew (2012).

Mathematical model
At first let us introduce the notation: N is the set of positive integers, N ∈ N, 2 ≤ N < ∞, A = {0, 1, . . ., N − 1} is the finite state space with N elements; J m n = (j n , . . ., j m ) ∈ A m−n+1 , m ≥ n, is the multiindex (subsequence of indices from a sequence j 1 , j 2 , . . .); {x t ∈ A : t ∈ N} is a homogeneous Markov chain of the order s, (2 ≤ s < ∞) with (s + 1)-dimensional matrix of transition probabilities P = (p J s+1 1 ): ) different square stochastic matrices of the order N : Kharin and Maltsew 2012), if its one-step transition probabilities have the following parsimonious form: where of the set {1, 2, . . ., M } occur in the sequence m 0 , . . ., m K .The sequence of elements J s s−L+1 is called the base memory fragment (BMF) of the random sequence, L is the length of BMF; the value s k = s − b k + 1 is called the conditional order.Thus the conditional probability distribution of the state x t at time t depends not on all s previous states, but it depends only on we have the fully-connected Markov chain of the order s.If M = K + 1, then each transition matrix corresponds to only one value of the BMF, otherwise there exists a common matrix which corresponds to several values of BMF.
Therefore the Markov chain of conditional order is determined by the following parameters: • unconditional order s of the Markov chain; • the length of BMF L; • K + 1 conditional orders {s k : 0 ≤ k ≤ K}; • K + 1 parameters {m k : 0 ≤ k ≤ K} which determine the transition matrices; • M stochastic matrices of the order N which are described by M N (N − 1) independent parameters.
Hence the transition matrix P = (p J s+1 1 ), J s+1 1 ∈ A s+1 , of the Markov chain of conditional order is determined by (2) independent parameters.For example, we need no more than 66 parameters for the Markov chain of conditional order if s = 10, L = 2, whereas the fully-connected Markov chain of this order requires D(s) = 1024 parameters.

Statistical estimators for parameters
In this section we present statistical estimators for parameters of the Markov chain of conditional order.Introduce the notation: X n 1 ∈ A n is the observed time series of length n, π 0 , is the initial probability distribution of the Markov chain of conditional order (1); is frequency of the state J l 1 ∈ A l with the time gap of length y between the elements j 1 and J l 2 ; ν s+1 (J s+1 ) is frequency of (s + 1)-tuple J s+1 1 .
At first, let us give ergodicity conditions for the Markov chain of conditional order.
Theorem 1.The Markov chain of conditional order is ergodic if and only if there exists a number m ∈ N, s ≤ m < ∞, such that the following inequality holds: Proof.Consider the first-order vector-valued Markov chain with the extended state space like in Doob (1953) which is equivalent to the s-order Markov chain {x t ∈ A : t ∈ N}.The transition matrix for X t has the following form: According to Kemeny and Snell (1963) the Markov chain X t is ergodic if and only if there exists a number m ∈ N, such that the following inequality holds: where p(c) is the c-step transition probability from J s 1 to J s+c 1+c for the Markov chain X t .Using properties of probability and definition (1) we come to the criterion (3).Theorem is proved.
In the sequel we will consider ergodic Markov chains.It is known, that the probability distribution of an ergodic Markov chain tends to a stationary probability distribution.The next theorem determines conditions under which the stationary distribution is uniform.
Theorem 2. If the Markov chain of conditional order is ergodic, then its stationary distribution is uniform if and only if the following equations hold (k = 0, 1, . . ., K): (5) Proof.As in the proof of Theorem 1 consider the first-order vector Markov chain X t .It is known from Borovkov (1998b), that the stationary distribution for X t is uniform if and only if P is a doubly stochastic matrix, that is Define k =< J 2s−1 2s−L > and transform ( 6) using ( 4) and (1): ) is a doubly stochastic matrix, and we have the second row in (5).If , and we have the first row in (5).Theorem is proved.
We will use the likelihood function to estimate transition probability matrices {Q (m k ) } and conditional orders {s k }.In order to build it we have to find n-dimensional probability distribution for the observed time series X n 1 generated by the model ( 1).Lemma 1.The n-dimensional probability distribution (n > s) for the Markov chain of conditional order (1) has the following form: Proof.Using theorem on compound probabilities and the Markov property we have: .
Corollary 1.The loglikelihood function for the Markov chain of conditional order (1) has the following form: Now we can construct maximum likelihood estimators (MLEs) for the transition probabilities {Q (m k ) : k = 0, . . ., K} and the conditional orders {s k : k = 0, . . ., K}.
Theorem 3. If the true values s, L, {s k : k = 0, . . ., K} and {m k : k = 0, . . ., K} are known, then the MLEs for the one-step transition probabilities {q where Proof.In order to construct the MLEs we need to solve the following problem: This maximization problem splits into N L+1 subproblems (j 0 ∈ A, J L 1 ∈ A L ): Solve these subproblems with Lagrange multiplier method and come to the estimators (9).Theorem is proved.
In the rest of the paper we will assume that M = K + 1, i.e.K + 1 independent matrices correspond to K + 1 different values of BMF, and m k = k + 1, k = 0, 1, . . ., K. In this case estimators (9) have the following form: We will also use the following notation for transition probabilities and their estimators: According to Kharin and Maltsew (2011) we construct estimators for the conditional orders {s k }.
Theorem 4. If s and L are known, then the MLEs for conditional orders {s k : k = 0, . . ., K} are ŝk = arg max In order to estimate the order s and the BMF length L we use Bayesian information criterion (BIC) (see Csiszar and Shields 1999): BIC(s , L ) = −2 where S + ≥ 2, 1 ≤ L + ≤ S + − 1, are maximal admissible values of s and L respectively, d is the number of independent parameters of the model (1) defined by formula (2).

Asymptotic properties of statistical estimators
Let us assume that the Markov chain (1) satisfies the stationarity condition.Define the probability distribution of the l-tuple X t t+l−1 ∈ A l , l ∈ N: At first, let us present results on consistency of the constructed statistical estimators from the previous section.
Theorem 5.If Markov chain of conditional order (1) is stationary, then the statistical estimators (9) are consistent estimators as n → ∞: Proof.It is known from Basawa and Prakasa Rao (1980) that frequencies of the states for the first-order vector Markov chain X t (considered in the proof of Theorem 1) tend to the stationary probability distribution as n → ∞: Thus we can prove that πs+1 (J s+1 ).Then we consider ν s L+2,g(s k ,L) (J L+1 0 ) and ν s L+1,g(s k ,L) (J L 0 ) as sums of the frequencies of (s+ 1)-tuples ν s+1 (J s+1 1 ): where A s+1 (y, J l 0 ) = {I s+1 1 ∈ A s+1 : i 1 = j 0 , I y+l y+2 = J l 2 }, y = 0, 1, . . . .So the following convergence holds: j 0 ,j L+1 ; using this equation and theorem on functional transformations of convergent random sequences from Borovkov (1998a), we come to (13).Theorem is proved.Theorem 6.Under conditions of Theorem 5 statistical estimators (11) are consistent as n → ∞: ŝk → s k , k = 0, . . ., K + 1. ( 14) Proof.Introduce the notation: is the Shannon information on the random symbol x L+1 contained in the random symbol x 0 under the fixed BMF is the Shannon information on the random symbol is the plug-in statistical estimator for I k (y).At first, note that where < J L 1 >= k.The second statement we need to prove the theorem, is the following: Using ( 16) and properties of Shannon information we can show that I k (s k ) ≥ I k (y), ∀y = s k .Thus applying the first continuity theorem from Borovkov (1998a) and the equation (15) we come to (14).Theorem is proved.
Theorem 7.Under conditions of Theorem 5 statistical estimators ( 12) are consistent as n → ∞: (ŝ, L) , where Using asymptotic properties of the estimators ( 10) and ( 11) it is easy to show that for n → ∞ the following asymptotics holds: where L + 1 ≤ y k ≤ s .Using properties of entropy and methods described in Csiszar and Shields (1999) we can prove that P{(ŝ, L) Now let us analyze the asymptotic normality property for estimators (10).Theorem 8 establishes asymptotic probability distribution of the normalized deviations of the statistical estimators for transition probabilities: Theorem 8.Under conditions of Theorem 5 as n → ∞ the normalized deviations {q(J L+1 0 ) : J L+1 0 ∈ A L+2 } have joint asymptotically normal probability distribution with zero mean and covariance matrix Proof.Let us give only a scheme of the proof.Complete proof can be found in Kharin and Maltsew (2012).The theorem is proved using asymptotic normality property for frequencies ν s+1 (J s+1

0
) as a function of these frequencies.Therefore using the third continuity theorem from Borovkov (1998a) we can establish asymptotic normality property for estimators (10) and come to (17).Theorem is proved.

Statistical testing of hypotheses on the values of {Q (k) }
Using the results of Section 4 let us construct a statistical test for two hypotheses: where γ( Corollary 3.Under conditions of Theorem 9 the power of the test (19) as n → ∞ tends to the limit: where G u,λ is the distribution function of the noncentral χ 2 -distribution with u degrees of freedom and the noncentrality parameter λ and α ∈ {0, 1} is the given significance level.
Let us note that the power doesn't tend to 1 because the alternative hypothesis H 1n tends to the null hypothesis as n → ∞.

Computer experiments on hypothesis testing
Simulated data.At first, we evaluate the test ( 19) performance for contigual alternatives (20) in two series of computer experiments by the following scheme: U = 1000 realizations of the Markov chain of conditional order were simulated according to (1).Parameters of the model: N = 2, A = {0, 1}, s = 8, L = 2, M = 4, s 0 = 8, s 1 = 6, s 2 = 8, s 3 = 3.The length of the time series n ∈ {1000, 1500, . . ., 20000}.In the first series of experiments the transition probabilities were chosen randomly for the null hypothesis H 0 .In the second series of experiments transition probabilities were chosen randomly to provide alternative hypothesis H 1 .In both series the frequency of the decision "accept the hypothesis H 1 " was calculated at the fixed value of n: where ρ u (n) is the value of ρ(n) calculated by the u-th realization.In the first series ν ρ is the estimator of the error I probability, we will denote it α.In the second series ν ρ is the estimator of the power, we will denote it ŵ.Results for the first series of experiments are presented in Figure 1; results for the second series of experiments are presented in  in bioinformatics, and fitting a stochastic model for genetic sequence is a fruitful approach to this problem decsribed in Burge and Karlin (1997).
The sequence of introns from the human gene HSHMG17G taken from "Bioinformatics and genomics" (http://genome.crg.es/) was analyzed.The length of the sequence n = 6922, S + ≤ 6, the size of the state space A is 4 (0 corresponds to nucleotide A, 1 to C, 2 to G, 3 to T).We used in computer experiments the following three Markov chain models: fully-connected s-order Markov chain (MC(s)), the Markov chain of order s with r partial connections (MC(s, r)) and the Markov chain of conditional order with BMF length L (MCCO(s, L)).For each model the value of BIC was calculated.Results are presented in Table 1.Minimum value of BIC is marked by bold type.As we can see from Table 1, the most adequate model is the Markov chain of conditional order with parameters: s = 6, L = 1.Estimators for conditional orders are: ŝ0 = 4, ŝ1 = 3, ŝ2 = 3, ŝ3 = 6.Estimates for transition matrices for this MCCO(6,1)  Let us note that the values of BIC close to the minimum are obtained for MCCO(4, 1) and MCCO(5, 1).These two models describe similar dependence to MCCO(6, 1), but they have shorter memory depth.Thus MCCO(6, 1) is chosen as the most adequate model, because the number of parameters for all three models is the same.

Conclusion
this paper we consider a new parsimonious model for discrete-valued time series called Markov chain of conditional order.Probabilistic and statistical properties of the model are established.Ergodicity conditions and conditions under which the stationary probability distribution is uniform are found.Statistical estimators for parameters are constructedwhich and their consistency is proved.Asymptotic probability distribution of the estimators for the transition one-step probabilities is found.Statistical test for the values of transition matrices is constructed and its asymptotic power for contigual alternatives is evaluated.Computer experiments on simulated time series and on real DNA sequences are conducted.
Figure 2. On both figures horizontal axis corresponds to the time series length n, vertical axis corresponds to the value of ν ρ ; in both cases α = 0.05.Solid line in Figure 1 plots the significance level α.Solid line in Figure 2 plots the theoretical power (21) of the test.As we can see, theoretical values of α and w are close enough to their experimental values α, ŵ respectively which are indicated by dark circles.

Table 1 :
Values of BIC.