Statistical Estimation and Classiﬁcation Algorithms for Regime-Switching VAR Model with Exogenous Variables

We consider a vector autoregression model with exogenous variables and Markov-switching regimes to describe complex systems with cyclic changes of states. To estimate and forecast the states, we propose EM and discriminant analysis algorithms based on non-classiﬁed and classiﬁed data samples accordingly. The accuracy of the algorithms is examined by means of computer simulation experiments


Introduction
Regime-switching models are a convenient tool for the analysis of complex systems with cyclic changes of states (Hamilton 2008).Most studies are devoted to Markov-switching vector autoregression model (MS-VAR) (Krolzig 1997).If the regimes are independent or there is a high uncertainty regarding the classes of states, then the models with independent-switching regimes may be more preferable.The autoregression and regression models of such type were entirely studied in Malugin and Kharin (1986) and Malugin (2014).The object of the study is a vector autoregression model with Markov-switching states including exogenous variables (MS-VARX), thus allowing a multivariate linear regression ones (Malugin 2014).

Models and tasks of research
Let a complex system at time t be characterized by a random observation vector defined on the probability space (Ω, F, P), where Ω is a space of elementary objects ω ∈ Ω; P is a probability measure: P(A) = P{ω ∈ A}, A ∈ F. Let {Ω 0 , . . ., Ω L−1 } be a decomposition of Ω into a finite number of non-empty disjoint subsets, such that: Ω l ∈ F, P{Ω l } = P{ω ∈ Ω l } > 0, l∈S(L) Ω l = Ω,S(L) = {0, . . ., L − 1}.These subsets are the classes of states of a complex system, and L is the number of classes.
A random vector y t = (x t , z t ) ∈ R n can be partitioned into subvectors of endogenous variables x t = (x tj ) ∈ N and deterministic exogenous variables (regressors) z t = (z tk ) ∈ Z ⊂ M .It is assumed that, in general, the time series is described by a model RS-VARX(p)(p ≥ 1): where x 1−p , . . ., x 0 ∈ N are a set of the given initial values; η d(t),t ∈ N is a random disturbances or innovation process; and d(t) ∈ S(L) = {0, . . ., L − 1} is a state of a system at time t.
M.3.Structural heterogeneity conditions: for matrices of autoregression and regression coefficients: We consider a model with L (2 ≤ L < s + 1) classes of states: where s ≥ 1 -number of state switching points 1 < τ 1 < . . .< τ s < T .Concerning the sequence of states d(t) ≡ d t ∈ S(L)(t = 1, ..., T ) there are two types of assumptions: d1. d t (t = 1, ..., T ) -independent identically distributed random variables with probability distribution d2. d t (t = 1, ..., T ) -homogeneous ergodic Markov chain (GCM) with the distribution determined by the vector of probability of the initial state π and matrix one-step transition probabilities P : Under the conditions of d1 and d2, we deal with the models IS-VARX and MS-VARX respectively.Model (1) includes a number of special cases: model of multivariate linear regression RS-MLR, if p = 0, M ≥ 1 (Malugin 2014); model RS-VAR without exogenous variables, if p > 0, M = 0 (Krolzig 1997).
The true values of model parameters {A l , B l , Σ l (l ∈ S(L)), π P and the moments of switching state {τ i }(i = 1, ..., s) are unknown.There is either classified or a non-classified sample of observations ( X, Z) ( X = (x t ) ∈ N T , Z = (z t ) ∈ Z T ⊆ T ), so that the vector of states d = (d t ) ∈ S T (L) is either known or unknown.We presented two statistical classification algorithms for MS-VARX model in these cases: EM algorithm for joint parameters and vector of states estimation for non-classified sample and discriminant analysis algorithm in the case of classified sample for classification of out-of-sample observations.For IS-MLR and IS-VARX models the listed tasks are solved in Malugin (2014).

Splitting of mixtures described by MS-VARX
Representations for the model parameters.Model (1) under the assumptions M.1-M.3,d.1 and d.2 can be represented in the regression form where Π d(t) = (A d(t),1 , . . ., A d(t),p , B d(t) ) is the block N × (pN + M ) -matrix of parameters; u t = (x t−1 , . . ., x t−p , z t ) ∈ N p+M -the stacked vector of predetermined variables formed from lagged endogenous and exogenous variables with values known at time t.
If the model (3) satisfies the assumptions M.1-M.3, then the random vector x t ∈ N under the given values u t ∈ N p+M and d t = l(l ∈ S(L)) has conditional normal distribution with density The likelihood function for parameters φ under the fixed state vector D ∈ S T (L) and assumptions (4) and d.2 is of a form: Let Λ(φ, φ) be the conditional expectation of the log-likelihood function l(φ; X, Ū , D) = ln L(φ; X, Ū , D) induced by the distribution P {D| X, Z; φ} of the random vector D given the fixed sample ( X, Ū ) and initial value φ of the parameter vector, i.e.
In accordance with a general approach (Malugin 2014;Bilmes 1998) we obtain an analytical representation for the unknown characteristics.In the considered case we have conditional normal distribution for vector of endogenous variables with the density p X (x; u, θ l ) for the given vector of predetermined (lagged or exogenous) variables u t = (x t−1 , . . ., x t−p , z t ) ∈ N p+M .Formulas for the posterior probabilities {γ l,t }, { ξl,t } are based on the density p X (x; u, θ l ) and followed from the Lemma 1. Lemma The proof of the Lemma 1 is based on the method from Bilmes (1998) for Gaussian Mixture with Markov regime switching.
Proof.Three terms Q 1 , Q 2 and Q 3 in the formula (6) depend on the various parameter sets.Therefore, the optimization problem for Λ(φ, φ) can be partitioned into three independent optimization problem for continuous in the parameters functions where a posterior probabilities {γ l,t }, { ξkl,t } are given.To maximize the functions Q 1 , Q 2 with equality constrained we use the method of Lagrange multipliers.Maximizing the function Q 3 of the form is carried out separately on matrices Π l and Σ l by calculating the derivatives and using properties of matrices operations (Anderson 1984).
EM-algorithm for MS-VARX.For joint estimation of all parameters φ ∈ q and state vector D ∈ S T (L) the EM MS-VARX-algorithm (Expectation-Maximization algorithm for MS-VARX ) is addressed.EM MS-VARX-algorithm belongs to the family of Baum -Welch algorithms of splitting of a mixture of multivariate distributions, controlled by a hidden Markov chain (Bilmes 1998).
The algorithm includes the following steps (superscript k in brackets indicates the iteration number).
Convergence problems for this type of algorithms are investigated in numerous studies, particularly in Krolzig (1997); Malugin (2014).The convergence of the algorithm ensures the consistence of the resulting parameters estimates φ, π, P as well as the consistence of the classification rule (13).

Discriminant analysis of the MS-VARX
The decision classification rule of multivariate autoregression observations ( X, Ū ) described by the MS-VARX model in general case can be defined as: D = ( dt ) = D( X, Ū ), dt = dt ( X, Ū ) ∈ S(L), t = 1, . . ., T .The accuracy of classification for this rule is characterized by the probability of misclassification: where D 0 = (d 0 t ) and D = ( dt ) are the true state vector and its estimate respectively.Assume first all parameters of the MS-VARX (3) to be known.Describe an optimal classification rule, called Bayesian decision rules (BDR) (Malugin 2014;Kharin 1996), which minimizes the probability of misclassification (15).Bayesian decision rules of pointwise and groupwise classification of multivariate observations described by IS-VARX and IS-MLR models, have been proposed and studied in Malugin (2014).In the considered case of MS-VARX model we addressing the groupwise classification decision rule.A similar problem in the case of a parametric family of continuous probability distributions was considered in Kharin (1996).To formulate the decision rule we will use the log-likelihood function, which for some fixed vector D according to (5) simplifies to: Lemma 2. If model MS-VARX (3) satisfies the assumptions of M.1-M.3,d.2 and the staked vector of parameters φ ∈ q is known, BDR of groupwise classification is determined by the condition where ) is a sample of observations to be classified.
Proof.It is known (Kharin 1996) that the decision rule of the form (17) for arbitrary family of parametric continuous distributions minimizes a probability of error classification.Such decision rules are known as Bayesian decision rules.Under the conditions of the Lemma 2 the vector of endogenous variables x t ∈ N corresponding to fixed values u t ∈ N p+M and d t = l(l ∈ S(L)) has conditional Gaussian distribution with density (4) which belong to the mentioned above family of parametric continuous distributions.
To solve the integer optimization task (17) for some fixed continuous vector φ ∈ q (q = Lm + (L − 1)(L + 1)) we will use the dynamic programming method (Kharin 1996;Bellman and Dreyfus 1962).Its implementation requires a special representation of the log-likelihood function l(φ; X, Ū , D) through the so-called Bellman functions.
Theorem 2. Under the conditions of Lemma 2, the BDR of groupwise classification of sample ( XT 1 , Ū T 1 ) is implemented using the dynamic programming method in accordance with the following relationships: where {F t (k)} are Bellman functions and {f t (k, l)} are described by formulas δ t1 -Kronecker symbol, t = 1, . . ., T − 1.
Proof.In conditions of Lemma 2 the formulas ( 18)-( 20) are obtained by means of equivalent transformation of function l(φ; X, Ū , D).Indeed, on the basis of ( 16), ( 17) and ( 20) we obtain: It is known that a dynamic programming procedure includes the following two stages which use formulas ( 19) and ( 18) respectively: 1) recursive calculation of Bellman functions {F t (l)} (l ∈ S(L), t = 1, ..., T −1) by the formulas 2) calculation of vector D components in the reverse order: Since parameters {θ l } (l ∈ S(L)), π, P are unknown, we need to use their estimates obtained from some sample of classified observations.To get such a sample as to find the estimates { θl } (l ∈ S(L)), π, P it is suggested to apply the proposed above EM MS-VARX algorithm.Thus, the following statement is true.
Corollary.If { θl } (l ∈ S(L)), π, P are consistent estimates of parameter for model (3), then using them in ( 15)-( 17) instead of unknown values of parameters we obtain a consistent "plug-in" Bayesian decision rule.

Performance evaluations
Parameter values for various experiments Characteristics of classification and estimation accuracy.The matrix H = B 1 − B 0 in the case A 1 = A 0 , Σ 0 = Σ 1 = Σ determines the degree of distinctiveness of classes, caused by structural changes in the matrix of regression coefficients.The probability of misclassification under the model assumptions is calculated according to the formulas (Malugin 2014;Kharin 1996):  1.
), where Φ(•) -the function of standard normal distribution, ∆(z) -interclass Mahalanobis distance at point z.The probability of misclassification is calculated by averaging the classification results of K = 100 random samples for each set of parameters using the formulas r = K −1 = (d 0 t ), Di = ( di t ) -true state vector and its estimate respectively for the i-th sample.The accuracy of the parameter estimates is determined by the characteristics δ θ = || θ − θ||, δ P = || P − P ||, where || • || is the Euclidean norm of the matrix and vector.Analysis of the results of experiments.Case 1.The impact of differences in matrix of regression and autoregression coefficients for different classes.Parameters value (set 1): variants B.1-B.3 for the matrix of regression coefficients, A 1 = A 2 = O N ×N , ω = 0.2.The estimates of accuracy measures for these experiments are presented in Table

Figure 2 :
Figure 2: The effect of uncertainty regarding the class of the EM MS-VARX algorithm (columns from left to the right): interclass distance ∆(z); estimate of the probability of misclassification rEM ; characteristics of parameters estimation accuracy δ θ Description of test models and examples.We consider the model MS-VARX in the form (1) or (3) under the assumptions M.1-M.3,d.2 with cyclic changes in the matrix of regression coefficients.The aim of experiments is to evaluate the accuracy of classification and prediction for the proposed decision rules.We use the following notation for the proposed classification algorithms for MS-VARX: BDR -Bayesian decision rule of groupwise classification algorithms; EBDR -estimated ("plug-in") BDR algorithms; EM -EM MS-VARX algorithms.General description of the test models:

Table 1 :
The impact of structural changes in regression coefficients.