Modified Likelihood Ratio Test for Sub-mean Vectors with Two-step Monotone Missing Data in Two-sample Problem

This article deals with the problem of testing for two normal sub-mean vectors when the data set have two-step monotone missing observations. Under the assumptions that the population covariance matrices are equal, we obtain the likelihood ratio test (LRT) statistic. Furthermore, an asymptotic expansion for the null distribution of the LRT statistic is derived under the two-step monotone missing data by the perturbation method. Using the result, we propose two improved statistics with good chi-squared approximation. One is the modified LRT statistic by Bartlett correction, and the other is the modified LRT statistic using the modification coefficient by linear interpolation. The accuracy of the approximations are investigated by using a Monte Carlo simulation. The proposed methods are illustrated using an example.


Introduction
Standard statistical methods have been developed for analyzing complete rectangular data sets; however, incomplete data sets are often encountered. In this study, we consider the problem of testing for two normal mean vectors on a subvector when the data set has twostep monotone missing observations. The two-step monotone missing data can be written as below: where"*"indicates a missing observation. That is, we have complete data for N Many statistical methods have been developed for analyzing data with missing observations (see, e.g., Anderson (1957); Anderson and Olkin (1985); Bhargava (1962); Jinadasa and Tracy (1992); Little and Rubin (1986) ;Shutoh, Kusumi, Morinaga, Yamada, and Seo (2010); Srivastava and Carter (1986); Yu, Krishnamoorthy, and Pannala (2006). As a previous study closely related to this study, Kanda and Fujikoshi (1998) discussed the distribution of the maximum likelihood estimators (MLEs) for two-step, three-step, and general k-step monotone missing data. For a two-step monotone missing pattern, Seko, Kawasaki, and Seo (2011) derived Hotelling's T 2 type statistic and the likelihood ratio test (LRT) statistic for testing two normal mean vectors and their approximate upper percentiles, and Kawasaki and Seo (2016a) derived the stochastic expansion of Hotelling's T 2 type statistic for a large sample with a one-sample problem. Kawasaki, Shutoh, and Seo (2018) discussed the asymptotic distribution of T 2 type statistic with two-step monotone missing data under a large-sample asymptotic framework.
Recently, a test for sub-mean vectors with two-step monotone missing data under a onesample problem was discussed by Kawasaki and Seo (2016b). They derived the likelihood ratio (LR) criterion for testing the (p 2 + p 3 )-mean vector under the given mean vector of p 1 -dimensions. Subsequently, they proposed an approximation of the upper percentile of the LRT statistic using linear interpolation based on Rao's U statistic for complete data sets. Naito T (2018) gave the T 2 type test statistic and simultaneous confidence intervals using the approximate upper percentiles of the T 2 type test statistic in one-and two-sample problems with tests for sub-mean vectors. Further, they considered simultaneous confidence intervals for pairwise multiple comparisons using Bonferroni's approximation in the k-sample problem. In this article, we extend the results of a one-sample problem given by Kawasaki and Seo (2016b) to a two-sample problem. In addition, a modified LRT statistic is given using the asymptotic expansion of the null distribution of the obtained LRT statistic. Moreover, we propose a modified LRT statistic using a modified coefficient by linear interpolation. These studies still have problems extending to k-step monotone missing data, although these are very complicated. In this article, we will first discuss in two-step monotone missing data.
The rest of the article is organized as follows. In Section 2, we review the test for a subvector based on non-missing data when the first p 1 dimensions of the mean vector µ (i) are equal. Then, we describe the definition and some notations and derive the MLEs and the LRT statistic for two-step monotone missing data. In Section 3, an asymptotic expansion and its distribution of the LRT statistic are derived, and we provide two modified LRT statistics. The accuracy of the approximate solutions is investigated by Monte Carlo simulation in Section 4. The results of Section 3 are illustrated using a numerical example in Section 5. Finally, Section 6 concludes this article. The proof of a result is completed in the appendix.
2. The LRT statistic 2.1. Non-missing data case In this section, we discuss the tests on subvector in two-sample case with non-missing data. Let y The sample mean vectors and two matrices of sums of squares are defined as where W k : p k × p and B k : p k × p , respectively. Then for testing the likelihood ratio λ is given by Under the null hypothesis in (1), the LRT statistic −2 log λ is asymptotically distributed as χ 2 with p − p 1 degrees of freedom when N (i) → ∞, i = 1, 2. The problems treated in this article concern making the modified LRT statistic based on the two-step monotone missing data.

Setting and problem
This section describes the missing data treated in this paper and the hypothesis testing problems to consider. Let x be the multivariate normal N p (µ (i) , Σ), and let , . . . , x (i) N (i) be the multivariate normal N p 1 +p 2 (µ (i) (12) , Σ (12)(12) ) for i = 1, 2, and We partition the p-dimensional vector x (i) , and x (i) 3j are p 1 × 1, p 2 × 1, and p 3 × 1 vectors, respectively. Similarly, we partition the (p 1 + p 2 )-dimensional vector x That is, we have complete data for N (i) 1 mutually independent observations with p-dimensions, and incomplete data for N (i) 2 mutually independent observations with (p 1 + p 2 )-dimensions, where N (i) 2 = N (i) − N (i) 1 , p = p 1 + p 2 + p 3 . Now, we consider the hypothesis where µ (i) 3 ) . The situation when a component of µ may be known is not rare. In some situations, partial information concerning the population means may be available. Srivastava (2002) introduced the motivation of this study for non-missing data with numerical example. Eaton and Kariya (1975) derived tests for the independence of two normally distributed subvectors when an additional random sample is available. Provost (1990) obtained explicit expressions when the MLEs of all parameters of the multi-normal random vector are given and the LRT statistic for testing the independence between subvectors is obtained. We discuss the problem of hypothesis (2) for data sets with two-step monotone missing observations in the two-sample case. In this section, we derive the MLEs of µ (i) and Σ, and the MLE of µ(= µ (1) = µ (2) ) and Σ under H 0 . Using these MLEs, we propose the LRT statistic.

MLEs and the LRT statistic
In this section, we consider the LRT statistic for (2). To derive the LRT statistic, we first consider the MLEs under the null hypothesis. Let the MLEs of µ (i) and Σ be denoted by µ (i) and Σ, respectively, and be partitioned in the same manner as µ (i) and Σ (i = 1, 2). We assume that the observation vectors are distributed as N p (µ (i) , Σ) and N (i) 1 > p, which is a necessary and sufficient condition for the existence and uniqueness of the MLEs of µ (i) and Σ. That is, the likelihood function is given by We define the sample mean vectors and unbiased sample covariance matrices as , and we define Using the likelihood equations, we can obtain the MLEs.

Two approximate solutions
In this section, we propose two approximate modified LRT statistics. Under H 0 , −2 log λ is asymptotically distributed as a χ 2 distribution with p 2 + p 3 degrees of freedom when N However, the chi-square approximation is very simple, but this approximation is not good when the sample size is not large. Therefore, we propose two correction coefficient to the LRT statistic and improve the approximation to the χ 2 distribution. For general theory of modified likelihood ratio statistics, see Muirhead (1982).

Modified LRT statistic
First, we consider the asymptotic expansion of the LRT statistic, −2 log λ, when In our derivations, we consider the stochastic expansions of µ, Σ, µ, and Σ in terms of We note that the stochastic expansions are derived under µ (1) = µ (2) = 0 and Σ = I p . From the above conditions, we have the following theorem.
Theorem 1. An asymptotic expansion of the distribution of the likelihood ratio test statistic, −2 log λ, can be presented as Proof. See the Appendix.
Using an asymptotic expansion of the null distribution of −2 log λ, the Bartlett correction coefficient of the LRT statistic is given by ρ = 1 − c/N . Then, we can derive the modified LRT statistic, −2ρ log λ, and

Modified statistic Q *
The modified LRT statistic, −2ρ log λ, given in Section 3.1, is a theoretical result; however, the equation is slightly complicated. In addition, although we will investigate this in detail in Section 4, the approximation is not always accurate. In this section, we propose an approximate solution of the upper percentiles, which is simpler approximate for the modified LRT statistic.
The coefficient of the modified LRT statistics for non-missing data is obtained by substituting p 3 = 0 and p 2 = p 2 + p 3 for the coefficient c given in section 3.1. If we denote the coefficients of the modified LRT statistic in the case of non-missing data cases N and N 1 by ρ N and ρ N 1 , respectively, then it may be noted that ρ * is between ρ N and ρ N 1 , where ρ * is the coefficient of the modified LRT statistic −2ρ * log λ. From the linear interpolation, we propose an approximation to the modified LRT statistic n the case of two-step monotone missing data. Then, we can obtain an approximate modified LRT statistic with two-step monotone missing data Next section, we compare the accuracy of two modified LRT statistics proposed in this section by simulation.
We simulated the upper 100α percentiles of the LRT statistic (−2 log λ) and modified LRT statistics (−2ρ log λ and Q * ), and the actual type I error rates for the upper percentiles of −2 log λ and −2ρ log λ, as well as Q * are given by respectively.
From tables, it is seen that the simulated values approach closer to the upper percentiles of the χ 2 distribution when both the sample sizes N  , 200) is listed as a representative example of a large sample size, hence it was confirmed that the same tendency was observed in other cases. It may be also that the upper percentiles of the two modified LRT statistic shows better results in all cases. It appears from the simulated results that the upper percentiles of −2 log λ and −2ρ log λ monotonically approach χ 2 α , but the upper percentile of Q * does not. While −2ρ log λ is obtained by asymptotic expansion, it is presumed that Q * is not monotonic because the ratio of ρ N 1 and ρ N forming ρ * depends on the relationship between p and p 3 . In addition to the ones listed in this article, we are conducting simulations in various cases, and we will also consider their trends. The results for actual type I error rates also show that −2ρ log λ is a good approximation for small p 3 /p values, and Q * is a good approximation for large p 3 /p values. Note that the case where the value of p 3 /p is small refers to (p 1 , p 2 , p 3 ) = (2,2,4), and in the other cases (2,4,4) and (4,4,8), the same tendency is observed. Cases with large values of p 3 /p were (p 1 , p 2 , p 3 ) = (2,2,2), (4,2,2), and (4,4,4), and the same tendency was confirmed at (8,2,2) and (8,4,4) as well.
The following Figures show the distribution of the LRT statistics, modified the LRT statistics, −2ρ log λ, and χ 2 distribution as an asymptotic distribution. The blue histogram represents the LRT statistic, the pink histogram represents modified LRT statistic, and the red solid line represents χ 2 distribution. Figures 1 to 4 show the behavior of the distribution of statistics when the dimension is (p 1 , p 2 , p 3 ) = (2, 2, 2) and the sample size are moved. From these figures, we can see that the accuracy of the modified LRT statistic correction is good even when the sample sizes are small. Figures 5 and 6 show the case where the dimensions are varied and the sample sizes are (N (i) 1 , N (i) 2 ) = (5, 5), i = 1, 2. As shown in Figures 1 and 5, when the dimension of the missing part increases, the approximation to the χ 2 distribution becomes worse; however, the modified LRT statistic shows better approximation accuracy than the LRT statistic.

Numerical example
We illustrate the results of this study using an example given in Wei and Lachin (1984). This data consist of serum cholesterol values that were measured under treatment at five different time points, baseline and months 6, 12, 20, and 24 in the placebo and high dose groups. The original data consists of 31 and 36 complete data and 17 and 29 missing data, respectively. In this article, we selected 8 observations randomly for complete data from each groups. And we use 4 observations which the data for the 24th month was not observed from the missing data of the two groups to create two-step monotone missing data. In this example, the first variable baseline seems to be equal between two groups, so we assume "given" in the hypothesis. Thus, we have the two-step monotone missing data of N (i) 1 = 8, N (i) 2 = 4 and p 1 = 1, p 2 = 3, p 3 = 1 for i = 1, 2.
For the above example, we obtained −2 log λ = 17.46 with a p-value of 0.00036. Since −2 log λ 4,0.01 = 18.13 from the simulation value, the null hypothesis is not rejected at the significance level of 0.01. When we use χ 2 4,0.01 = 13.28, the null hypothesis is rejected. Thus, using χ 2 percentile gave an incorrect test result. In contrast, we obtained −2ρ log λ = 13.23 (p-value 0.00758) and Q * = 12.36 (p-value 0.01325) from the above example. Since the simulation value are −2ρ log λ 4,0.01 = 13.74 and Q * 4,0.01 = 12.84, the null hypothesis is not rejected in either case. When we use χ 2 4,0.01 , the null hypothesis is also not rejected and the test results are the same as when using simulated values. We also performed 10 times on data obtained from the same sampling procedure, and confirmed that similar results were obtained.

Conclusion
We considered the two-sample problem of testing for sub-mean vectors with two-step monotone missing data. We derived the LRT statistic by deriving MLEs and provided the null distribution of the LRT statistic. Then, we proposed two modified LRT statistics. In addition, we showed the upper 100α percentiles of the LRT statistic and modified LRT statistics, as well as the type I errors, when the null hypothesis was rejected using χ 2 p 2 +p 3 under their simulated statistic. In order to explain this result, we verified whether it can be used with numerical examples. In conclusion, simulation results showed that the modified LRT statistics provided a better approximation than the LRT statistic in all cases.
As an issue for the future, expansion of the general m population problem can be considered. In the other hand, k-step monotone missing data, even the discussion on one population problem remains.
Therefore, by inverting the characteristic function, the proof of Theorem 1 is complete.