“Plug-in” Statistical Forecasting of Vector Autoregressive Time Series with Missing Values

The problems of statistical forecasting of vector autoregressive time series with missing values are considered. The maximum likelihood forecast is constructed and its mean square risk is evaluated for the case of known parameters. The “plug-in” forecast and statistical estimators are constructed for unknown parameters. Asymptotic properties of constructed estimators are analyzed. Results of numerical experiments are presented.


Introduction
Missing values are a typical distortion of model assumptions in data analysis (Little and Rubin, 1987;Stockinger and Dutter, 1987).It is fairly common for a time series to have gaps for a variety of reasons (Greene, 2000;Shafer, 1997): 1) the data do not exist at the frequency we wish to observe them; 2) registration errors; 3) deletion of "outliers".
Vector autoregression (VAR) is often used in practice for statistical analysis (statistical estimation of parameters, statistical testing of hypotheses, statistical forecasting) of time series in econometrics (Pantula and Shin, 1993), biometrics (Beran et al., 1998), technometrics (Jones, 1980) and in many other applications.
If the parameters of the VAR-model are a priori known, then the ML-forecast (Kharin2 and Huryn, 2003) can be used.In practice, the parameters are usually unknown, and there are three approaches to forecasting in this prior uncertainty: 1) joint maximum likelihood estimation of the future value of the considered time series and the parameters; 2) application of the EM-algorithm (Little and Rubin, 1987); 3) "plug-in" approach, consisting of two stages: a) estimation of the parameters by some admissible approach; b) calculation of the forecast putting the estimates of the parameters into the ML-forecast.
The first approach is characterized by significant computational complexity; the second approach, and the first approach also, suffice from the multimodality of the objective function (because of the difficult problem of detection of the main maximum for the situation with many local ones).To avoid these difficulties we develop here the third approach.
In Section 2 we define the underlying model of time series and formulate main assumptions on probabilistic characteristics of the model and on the "missing patterns".Section 3 is devoted to the ML-forecasting under missing values for two levels of prior uncertainty: 1) parameters of the VAR model are known; 2) the parameters are unknown.In Section 4 we construct statistical estimators of the parameters B, G, "plug-in" forecasting procedure, and also analyze asymptotic properties of the constructed estimators.Section 5 contains some numerical results using real statistical data.

Mathematical Model
Let the observed d-vector time series Y t be described by the VAR(1) model: where Z is the set of integers,  (Anderson, 1971).
Let us introduce some assumptions on the innovation process U t and "missing patterns" {O t }.A1.The moments of the orders three and four for the innovation process are bounded: A2.The moment of the fourth order for the innovation process is independent of the time moment and the moments of the orders from 5 to 8 for the innovation process are bounded: A4.The "missing patterns" {O t } satisfy the asymptotics at T → ∞ (i, j ∈ {1, . . ., d}): (2) ν (0) ij is the limit frequency for the pair of components (i, j) observed at the same time moment, ν (1) ij is the limit frequency for the pair of components (i, j) observed at the neighbor time moments.A5.The "missing patterns" {O t } satisfy the asymptotics at T → ∞: i,j,i ,j (τ ) is the limit frequency for the pair of components (i, j) observed together with the pair (i , j ) at the delay τ , ν i,j,i ,j (τ ) is the limit frequency for the pair of components (i, j) observed together with the components (i , j ) observed at the delay τ − 1 and τ ; i,j,i ,j (τ ) is the limit frequency for the pair of components (i, j) observed at the neighbor time moments and the pair (i , j ) observed at the neighbor time moments at the delay τ , ν Let Y T +τ ∈ R d be a "future vector" to be forecasted for τ ≥ 1, ŶT+τ = ŶT+τ (X) : R K → R d be a forecasting statistic (procedure).Introduce the matrix risk R ∈ R d×d and the (scalar) risk r of forecasting: It is known (Greene, 2000), that for the case of complete observations and known parameters B, Σ the minimal risk r * 0 = tr (Greene, 2000):

ML-Forecasting Under Missing Values
Define a bijection M ↔ {1, . . ., K} : k = χ(t, i) and the inverse function Lemma 1 Let the model (1) take place.The following expressions for F , H hold: (3) Proof.Using the expression for the covariance matrix for the VAR(1) model (Anderson, 1971): In the same way we find H.
Theorem 1 Let the model ( 1) and the assumption A3 take place.If the true values B, Σ are known, and |F | = 0, then the ML-forecasting statistic and its risk functionals are (5) By Theorem 1 assumptions, the vector Y + has the Gaussian distribution.By the Anderson theorem (Anderson, 1971), the likelihood function where n K (X|µ, Σ) means the K-dimensional Gaussian p.d.f. with the parameters µ, Σ.
The ML-forecast is the solution of the extremum problem: Since the first multiplier in (6) does not depend on Y T +τ , we come to the unique solution (4): ŶT+τ,ML = H F −1 X = A 0 X.Using the total mathematical expectation formula and (4), we find the risk (5): Theorem 2 Let the model ( 1) and the assumption A3 take place.If B, Σ are unknown, |F | = 0, then the ML-forecast of Y T +τ has the "plug-in" form: where the ML-estimators B, Σ of the model parameters are the solution of the minimization problem: Proof.According to the equation ( 6), the joint ML-estimators of Y T +τ , B, Σ are the solution of the extremum problem: l (Y T +τ ; B, Σ) → max Y T +τ ,B,Σ .From Theorem 1 we get ( 7), where the ML-estimators B, Σ of the model parameters are the solution of the problem: Taking the logarithm, we come to the statement.
Let us define the minimal admissible observation time for the observed time series It follows from the underlying model (1), that |G| = 0 and the matrices B, G, G 1 satisfy the matrix equation Following the "plug-in" principle and using the previous equation let us construct a matrix statistic (if | Ĝ| = 0): Putting then the statistics B, Ĝ (instead of B, G) into (3) we get the matrices F , Ĥ, the statistic (if and the "plug-in" forecasting procedure: According to Lemma 1, theorems 1 and 2, the performance of the "plug-in" forecasting procedure is determined by the performance of the underlying estimators B, Ĝ, Ĝ1 . Let us analyze asymptotic properties (T → ∞) of the proposed estimators (8), ( 9).Denote functions (t, t , τ ∈ Z, T ∈ N, i, j, i , j ∈ {1, . . ., d}) generated by the "missing patterns": , and covariances: Statistics, Vol. 34 (2005) The following lemma is straightforward and the proof is omitted.
Lemma 2 Let the model ( 1) and the assumptions A4, A5 take place.Then at T → ∞ the following asymptotic behavior of the "missing patterns" {O t } takes place: Lemma 3 Let the model ( 1) and the assumption A2 take place.Then the covariance depends functionally on the time moments t, t through their difference t − t only, where t, t , u, u ∈ Z, λ ∈ [0, 1), and the relations for covariances take place: −τ,i,j,i ,j , τ ∈ Z, i, j, i , j ∈ {1, . . ., d}.
Proof.The first statement of this lemma follows from the invariance property (w.r.t.time t) of the moments Σ, Σ (4) .The second statement -from the inequality for powers of the matrix B (Anderson, 1971): By replacement of indices one can easy come to the third statement of lemma.
Theorem 3 Let the model ( 1) and the assumptions A1, A4 take place.Then the estimators ( 8), ( 9) are consistent at T → ∞: Proof.The statement follows from the expression (9) for matrix B and the properties of estimators for covariances (8).
Next results on asymptotic normality of B are based on the following central limit theorem for m T -dependent random vectors (Shergin, 1976;Maejima, 1978).
< ∞ and the following assumptions are satisfied: 1) Let a sequence of random vectors . ., d}, then let us call these covariances as the asymptotic covariances of the sequence ξ T , and let us denote them in the following form: In the same manner define the asymptotic variance and asymptotic mathematical expectation of the sequence ξ T : Lemma 4 Let the model ( 1) and the assumptions A2, A4, A5 take place.Then at T → ∞ the vector, composed of elements of matrices √ T Ĝ − G , √ T Ĝ1 − G 1 , has the asymptotically normal distribution with zero mean and asymptotic covariances: τ,i,j,i ,j C (1) τ,i,j,i ,j , where i, j, i , j ∈ {1, . . ., d}.
Proof.According to the theorem of Shiryaev (1995), a vector has an asymptotically normal distribution if and only if any linear combination of its elements has an asymptotically normal distribution.Define the linear combination with arbitrary coefficients α ij , β ij ∈ R: where , where [•] means integer part, Yu. Kharin and A. Huryn 171 and using lemmas 2, 3, one can easy verify the assumptions of Theorem 4 and thereby prove that η 1 (T ) has the asymptotically normal distribution with zero mean and asymptotic variance: τ,i,j,i ,j C (1) τ,i,j,i ,j + α ij β i j g (2) τ,i,j,i ,j C (2) τ,i,j,i ,j + The convergence in probability of the second and the third terms to zero takes place: η 2 (T ) −→ P 0, η 3 (T ) −→ P 0 at T → ∞.
Then according to the theorem of Shiryaev (1995) the vector, which is composed of the elements of the matrices √ T Ĝ − G , √ T Ĝ1 − G 1 , has the asymptotically normal distribution with zero mean and asymptotic covariances (10).
Theorem 5 Let the model ( 1) and the assumptions A2, A4, A5 take place.Then at T → ∞ the vector, composed of the elements of the matrix √ T B − B , has the asymptotically normal distribution with zero mean and asymptotic covariances: , where i, j, i , j ∈ {1, . . ., k}.
Thus the vector, composed of the elements of the matrix √ T B − B , has the asymptotically normal distribution with zero mean and asymptotic covariances (11).

Numerical Results
To evaluate the performance of the estimators (9), the experiment on the celebrated "Beveridge price index for wheat 1500 -1869" (Anderson, 1971)

Figure 1 :YuFigure 2 :
Figure 1: Sample variance of the estimator B a matrix of unknown coefficients (all eigenvalues of the matrix B are assumed to be inside the unit circle),U t = (U t1 , ..., U td ) ∈ R d , {U t } are i.i.d.random vectors, E{U t } = 0 d is the zero d-vector, E{U t U t } = Σ = Constt, |Σ| = 0,where by Const ī1 ,..., īn we will denote a mathematical object(variable, vector, matrix, etc.)that is independent of the variables i 1 , . . ., i n , n ∈ N.There are missing values in observations {Y t }.For each vector Y t the binary vector (called "missing pattern") O t = (O t1 , . . ., O td ) ∈ {0, 1} d is given, where O ti = {1, if Y ti is observed; 0, if Y ti is a missing value}.Define the discrete set M = {(t, i), t ∈ Z, i ∈ {1, . .., d} : O ti = 1}; its elements are assumed to be lexicographically ordered in ascending order; K = |M | is the total number of observed components; t − = min{t : d i=1 O ti > 0} is the minimal time moment with observed components, t + = max{t : d i=1 O ti > 0} is the maximal time moment with observed components.Without loss of generality assume t − = 1, t + = T .Note, that AR(p)-model and VAR(p)-model can be transformed to VAR(1)-model increasing the number of components was made.The AR(3) model is y t+1 = 0.7489y t − 0.3397y t−1 + 0.0388y t−2 + ξ t , where {ξ t } are i.i.d.Gaussian random variables with zero mean and the variance σ 2 = 4, t − = 1, t + = T = 100.This model was transformed to the 3-variate VAR(1)-model (1) with Y t = (Y t , Y t−1 , Y t−2 ) ∈ R 3 and was considered for the "missing patterns" O t = 0, t =