Test for Linearity in Non-Parametric Regression Models

The problem of checking the linearity of a regression relationship is addressed. The test uses nonparametric estimation techniques. The null hypothesis is that the regression function is linear; it is tested against the non-specic alternatives hypotheses. This test is based on a Hermite transform characterization of conditional expectations. A statistical test is derived, the distribution of this statistic under the null hypothesis of linearity is determined. A power study using simulation shows the new statistic to be more sensitive to non-linearity.


Introduction
Let (X, Y ) be a pair of real-valued random variables. In many situations, the linear model is insufficient to explain the relationship between the response variable Y and its associated covariates X. A natural generalization is to model the mean nonparametrically in the covariates. Suppose (X i , Y i ) i=1,...,n are an independent and identically distributed random variables as (X, Y ), where Y is variable response and X is the covariates. Consider the following non parametric regression model where ϕ is assumed to be unknown. The function ϕ : R → R defined by ϕ(x) = E(Y |X = x) is called the regression function of Y on X. The set of values (X 1 , ..., X n ) is called the design. The random design setting stands in contrast to the fixed design setting, where the covariates X 1 , ..., X n are fixed (non-random), with only the responses Y 1 , ..., Y n being treated as random.
The covariance structure of the design points is completely known and need not be estimated. The residual ε i are i.i.d. random variables with E(ε i ) = 0 and var(ε i ) = σ 2 .
Nonparametric regression analysis relaxes the assumption of linearity, substituting the assumption of a smooth regression function. The cost of relaxing the assumption of linearity is much greater computation than in the case of ordinary least squares estimation and, in some instances, a more difficult-to-understand result. A variety of methods are available to estimate ϕ, based on non parametric regression models. These methods have been proposed to make the specification of the conditional mean function as flexible as possible. The standard approaches include splines, wavelets, moving averages, running medians, local polynomials, regression trees, neural networks, and other methods like the kernel regression estimators.
Testing the unknown regression function appearing in a nonlinear model has not received much attention in the statistical literature, like the estimation of this non-parametric regression function. The most closely related articles in our literature is the test developed by Mohdeb and Mokkadem (1998) based on the Fourier coefficients. They consider a nonparametric regression with regular et deterministic design, in other words they assume that X i = i n , ϕ is a function from [0, 1] to R and the observations Y i are given by where ε i are i.i.d. with mean zero and variance σ 2 . They obtain the asymptotic behavior of their proposed test, that is the level and the asymptotic power of the test. Pearson (1905), developed the test for linearity of regression expressed in terms of the correlation coefficient r and η the correlation ratio. Correlation ratio is a coefficient of non-linear association. In the case of linear relationship, the correlation ratio that is denoted by η becomes the correlation coefficient. In the case of non-linear relationship, the value of the correlation ratio is greater, and therefore the difference between the correlation ratio and the correlation coefficient refers to the degree of the extent of the non-linearity of relationship. The likelihood ratio test statistic is considered by Gallant (1975) for the hypothesis H : θ = θ 0 against A : θ = θ 0 using the nonlinear regression model Y = ϕ(X, θ) + ε with normal errors and unknown variance. A test of normality based on the Hermite polynomials was proposed by Bontemps and Meddahi (2005).
The tests for both the linearity hypothesis and the fit of a regression model have been proposed in some works. Actually, power and consistency have been proven frequently. Let us recall that, testing against a linear regression model literature is very extensive and still growing; we refer to González-Manteiga and Crujeiras (2013), which provided an excellent summary of existing procedures. The literature of these tests are wide, we point out Weihrather (1993) proposed a method for testing the quality of the fit of a linear regression model. Practically, the test statistic is based on a distance measurement between the adjustment of the linear model and the adjustment of the nonparametric model. The properties of the completed samples are studied by way of a simulation experiment with respect to the power of the test in special alternatives. Concerning the Härdle and Mammen (Hardle, Mammen et al. 1993) test assumes the non-parametric approach Y = m(X) + e where the only available information is provided through the sample {(X 1 , Y 1 ), ..., (X n , Y n )}. The terms e stands from a random error with a zero mean. Their goal is to test the hypothesis H 0 : m(.) linear, using measurements of the gap between parametric and non-parametric approaches. This test is based on the integrated quadratic difference between parametric adjustment and nonparametric adjustments. It is worth noting that the power of the test has not studied in their work. We mention also the work of Stute and Manteiga (1996). This paper proposed statistics based on some distance between nonparametric and parametric estimation. It is carried out by a minimum-distance criterion, instead of maximum likelihood estimation. Eubank and Spiegelman (1990) procured the test by fitting a spline smoothing together with the residuals of linear regression. They investigated the use of nonparametric regression procedure to test the adequacy of a parametric linear model. The authors consider that the model in such setting are of dependent variable and from a sum of a linear part of a function in a known design points (x i , l(x i ) ), where l is unknown function up to an error (written as y = a t x + l(x) + e).
The function l is assumed to belong to a general class of functions. The tests are established from non-parametric regression adjustments to the residuals of linear regression. Simulation experiments involving a test based on fitting cubic smoothing splines to residues reveal that this test has good power properties against several alternatives. However, their test is limited by the assumption of normality on the error term. Azzalini and Bowman (1993) examined the problem of verifying the linearity of a regression relationship through the idea of smoothing a residual plot. The authors apply the pseudo-likelihood ratio approach in the context of linear regression. The true regression function is estimated by non-parametric smoothing and then compared to an adjusted parametric model. Any deviation is assessed by a pseudo-likelihood ratio test. A power study has been examined, which shows that the new statistic is more sensitive to non-linearity compared to that of the Durbin-Watson statistic. Bierens (1982) introduced two coherent model specification tests. The first one is simple, but rather coarse; the second is more involved and laborious test. These tests are based on a characterization by Fourier transform of the conditional expectations. The author used a family of exponential functions to generate an infinite number of moment conditions that are necessary to assess the consistency of the conditional moment test. However, calculating the statistical test requires computing a maximum over an infinite set, which can impose a significant computational load in practice. To overcome the problem, the author proposed to draw randomly a sequence of elements of the infinite set and calculate the maximum. Zheng (1996) proposed a test that combines the idea of the conditional moment test and the methodology of nonparametric estimates. The author used the kernel method to construct a moment condition which can be used to distinguish between the hypotheses, null, and alternative. The test has an advantage over tests based on measuring the distance between parametric and non-parametric models. Actually, it imposes very few regularity conditions, beyond those generally required on nonlinear least squares. Most tests have the inconvenience of the inconsistency with deviations from the parametric model, the general alternatives, or an alternative with infinite dimensions.
The main purpose of our paper is to provide a new test for linearity in the regression model with random design. The approach is testing the linearity of ϕ from the data (X 1 , Y 1 ), ..., (X n , Y n ), without estimating it. The statistic is based on the Hermite transform. We proceed as in Djeddour, Mokkadem, and Pelletier (2007) adapting the specificities.
The outline of this paper is as follows. Section 2 introduces the test construction. In Section 3 the main results are presented. In Section 4 we check the accuracy of the test on simulated data. Section 5 a conclusion has been drawn. An appendix provides main mathematical proofs in Section 6. The Section 7 is an appendix gives some definitions of Hermite polynomials.

Hermite polynomials
We use here the family of orthogonal polynomials on the real line. Generally the polynomial p(x) is written in terms of the monomials x j . This is known as the natural form of the polynomial. The trouble with the natural form is that the monomials are very highly correlated. The idea behind orthogonal polynomials is to select the basic polynomials p j (x) to be as different from each other as possible. Two polynomials p i and p j are said to be orthogonal if p i (X) and p j (X) are uncorrelated as X varies over some distribution.
There are many ways to approximate functions. However, polynomial approximation is, relatively straightforward for many purposes. The theorem of Weierstrass (Queffélec and Zuily 2007) state that a function, continuous in a finite closed interval, can be approximated with a preassigned accuracy by polynomials.
Remark 1. The Hermite polynomials are thus orthogonal with respect to the standard normal probability density function 1 √ 2π e − x 2 2 with mean zero 0 and variance 1.
The Hermite transform has drawn significant attention, since it exhibits some important properties and high suitability for several applications in different research fields e.g. in astrophysics (Leonis 1980;Öztürk and Gülsu 2014).
Definition 2. Hermite polynomials are a series of polynomials. They are defined as: One of the remarkable properties of polynomials H n (x) is that the derivative of one of them is equal to the antecedent polynomial multiplied by a constant factor, ie: The other is a relation of recurrence linking three consecutive polynomials: The set of two relations is characteristic of polynomials Hermite; the only sequence of polynomials that satisfies the two equations is the sequence of Hermite polynomials.
Approximation of functions by polynomials is basic for a great many numerical techniques. Most numerical analysis texts include a treatment of polynomial approximation. There are many purposes for which polynomial approximation is in statistics. One of them is to model a nonlinear relationship between a response variable and an explanatory variable, as we will see in the sequel. Recall that if E(ϕ(X)) = 0 and E(ϕ(X) 2 ) < ∞ for X ∼ N (0, 1), ϕ(X) can be expanded in Hermite polynomials, that is, and Observe that the expansion (3) starts at k = 1, since Denote by k 0 ≥ 1 the Hermite rank of ϕ, namely the index of the first non-zero coefficient in the expansion (3). Formally k 0 = min{k ≥ 1, c k = 0}.
Hermite transform is an integral transform, which uses Hermite polynomials H n (x) as kernels of the transform. The Hermite transform of a function ϕ(x) is We can generalize the study in the case of p explanatory variables by using Hermite polynomial Because the collection of all Hermite polynomials H k is an orthonormal basis for the Hilbert space L 2 (γ 1 ), we have that the collection of all Hermite polynomials H α is an orthonormal basis for the Hilbert space L 2 (γ p ). For γ p the standard Gaussian measure on R p , with mean 0 ∈ R p and covariance operator I R p , the collection H α : α ∈ Z n ≥0 is an orthonormal basis for L 2 (γ p ). The disadvantage is the application of the moment formulas of Hermite polynomials which in this case are very painful to handle.

Test construction
In the following, we consider the regression with a single explanatory variable. This allows us to avoid too heavy calculations which do not enhance the subject. Let X real-valued random variable with the density of probability f on R, assuming it exists and defined by (see Remark 1). The problem is to construct a test of the hypothesis H 0 : ϕ linear against the alternative H 1 : ϕ non-linear Let H k (x) be the Hermite polynomials and c k (ϕ) the Hermite coefficients of ϕ defined by: where ϕ is the true but unknown regression function. We assume that f ϕ ∈ L 2 (R) where the space L 2 (R) of square-integrable functions with respect to the Lebesgue measure on the real line are natural domains on which to define the Hermite transform and Hermite series.
The procedure we consider here is the following. In particular, we are interested in testing for non-linearity in regression models. Then under H 0 , the process y i is linear in mean conditional on x. The approach takes place for the hypotheses with x t denoting the transpose of x. The alternative of interest is the negation of the null, that is, In regression problems, the mean relationship, as the first-order quantity, is generally of much more interest than higher order properties such as constant variance or normality.
If ϕ is linear, we write it as We establish the Hermite coefficients of ϕ according to (4); this gives We can write it as follows: Then test that ϕ is linear is equivalent to test if c k is zero. In this setting, we test the null hypothesis be the empirical estimators of c k and let m = m(n) be a sequence such that m(n) → ∞ as n → ∞ (this allows to consider only a pack of c k in (6) eg m = n/10 or m = n/2 or m = n). We want to test (6), to this end we construct the statistic We reject H 0 when T m,n is large.

Main results
To simplify matters and without loss of generality, we will assume σ to be known and equal to 1. If σ = 1 but known, this will not change the conclusions but it will weigh down the development of the calculations. It suffices to replace E(ε 2 i ) by σ 2 instead of 1. If σ is not known it is necessary to estimate it and to take into account its law to find the law of the statistic of test. It is not addressed in this work.
We consider first the case of regression of a response variable on a single covariate, with observed values y = (y i , ..., y n ) and x = (x 1 , .., x n ) respectively, expressed in the model (1) where ϕ is assumed to be unknown and where the ε i are independent random variables with mean 0 and standard deviation 1. Our aim is to assess whether model (1) can be reduced to the simple linear form y i = αx + βx i + ε i (i = 1, ..., n) with least squares estimators of the intercept and slope parameter.
We can write The empirical Hermite coefficients are defined as (7). Under H 0 , this can be write as We prove the following theorems.
Proposition 5. At fixed k, we demonstrate that Proposition 6. We show that the term c k2 = 1 n n j=1 H k (X j )ε j converges to 0 in probability.
By combining the result of proposition (5) and result of proposition (6), we conclude to the asymptotic normality of c k (at k fixed).
The variance matrix is written as: It is positive definite because there is no almost sure affine relation between the components of the random vector. Since c k are correlated with each other, instead of taking T m,n (defined by 8) as a test statistic, one can use the following test statistic From what precedes, we can state the following theorem.
Proof. The following proof characterizes the behavior of our statistics under the null hypothesis. Let C m = ( c 2 , ..., c m ) t . Concerning the asymptotic law of C m ; let x 2 , . . . , x m , real fixed and be V = (x 2 . . . x m ) t . It must be shown that V t C m = m k=2 x k c k converges in distribution to N (0, V t ΣV ). We deduce that C m converges in law towards N (0, Σ), from there it comes that T n = C t m Σ −1 C m converges in law towards χ 2 (m − 1). Martingale theory is used to obtain a central limit theorem for C m -statistics. By posing ξ m,k = x k c k m = 2, 3, ... k = 2, ..., m. For a triangular array we can introduce the row-sums The martingale Central Limit theorem then assures that m k=2 x k c k converges in law to an N (0, V t ΣV ). We deduce that C m converges in distribution to N m−1 (0, Σ).
Recall that a sequence of random vectors (U n ) n of R m−1 converges in law towards a random vector U if and only if x t U n converges in law towards x t U for all x ∈ R m−1 (from the characterization of the law convergence using the characteristic functions). To be able to use the test statistic T m,n for testing we must calculate it using sample and compare with the quantile of the distribution under H 0 . Rejection of this null hypothesis generally leads to the belief of the existence of non-linear trend. Despite its mathematical convenience, there is no special reason to believe a simple linear trend function would be suitable to model the complex system. According to the central limit theorem, when m is "large" (m > 100), the law of a variable of χ 2 , a sum of independent random variables, can be approximated by a normal law of expectation (m − 1) and of variance 2(m − 1). Under the null hypothesis H 0 , we deduce that for m large, T m,n will be normally distributed with mean zero and with a given variance. Under alternative hypothesis H 1 that the Hermite coefficients c k = 0, it is possible that for m large, T m,n will be normally distributed with a given variance which is a very complicated function. Despite the seemingly usually setting, the answer to this question is highly non-trivial.

Case studies
A necessary condition for an effective analysis of statistical data is that statistical models summarize the data with precision. Nonparametric regression does not specify the form of the regression function before examining the data. This theory might suggest that y depends on x, but it is unlikely to tell us that the relationship is linear.
This section features some simulation experiments of the test statistics which are performed out to assess the usefulness and the accuracy of the results obtained in Section 3. First we suppose that y i = αx i with α = 0.8, we generate a random variable X according to the Gaussian ditribution N (0, 1), we get (y 1 , y 2 ....y n ). In order to verify that the c k are Gaussian, we plot the histogram of a single coefficient namely c 3 . For k = 3 fixed, that is which follows a normal distribution with the size of the sample n = 40000, this is shown in Figure 1: We present some results in order to glimpse a brief description of the test performance under H 0 . The test statistic T m,n is calculated on a set of simulated data with different sizes n and m and for different values of α. Let y = αx (14) We follows the steps below: 1. we generate a random variable X which follows a law N (0, 1). We get (y 1 , y 2 ....y n ), for α fixed.
2. we calculate recursively the Hermite polynomials 3. we evalute C m Table 1 reports the test statistics calculated for different values of α, m and n, for the 5% critical value.
For that we do a simulation that follows the steps below: 1. we generate a random variable X which follows a law N (0, 1). We get (y 1 , y 2 ....y n ), for α, β fixed.
2. we calculate recursively the Hermite polynomials

we evalute C m
Several values of the statistic T m,n are computed for different m, n and α.

Simulation study
In this section, we demonstrate the performance of our test by some numerical examples. The first set of examples concerns the linear model. We applied the test in the form described in section 3. Moreover, in order to see the sensitivity of the results for the choice of n and m, we have applied the test several times, namely for different values of n and m. The results are presented in Table 3. The test is evaluated with replications. The simulation are repeated 500 times for each setting and we calculate the average T m,n at each setting and then compare the average T m,n with the chi-square value. The test statistic T m,n is calculated on datasets with different size n and m and for different values of α accordind to (14). We get  Table 2, we notice that T m,n < χ 2 (m−1,0.05) , so we accept H 0 significantly at the level 5%; i.e ϕ is linear, that's what we expected. In Table 3 we see that the test statistics of tests are rather sensitive for the choice of m.
Some additional numerical work has been carried out. The second set of examples concerns the non-linear model. The test is evaluated with replications. The simulation are repeated 500 times for each setting and we calculate the average T m,n at each setting and then compare the average T m,n with the chi-square value. The test statistic T m,n is calculated on datasets with different size n and m depending on the value of n and for different values of α accordind to (15).  Table 2, we notice that often T m,n > χ 2 (m−1,0.05) , so we reject H 0 significantly at the level 5%; i.e ϕ is non-linear, that's what we expected.
The test was evaluated in terms of the power. For the power performance, we has select a For Type I error, we follow the same approach but the data is generated from the null hypothesis and we calculate the proportion of times the null hypothesis is rejected. We get We call by empirical size, the percentage of falsely rejecting the null hypothesis H 0 . On the other hand, the empirical power represents the percentage of rejection of H 0 when we arbitrary choose a false model. The results in Table 5 numerically confirm the results announced. We notice that the test is moderately powerful but theType I errorr is very often small. We deduce that the consistency of our procedure is numerically convincing, which is in accordance with we expected.

Conclusion
In this work, we developed a linearity test of a linear regression model using the use of Hermite polynomials. A nonparametric regression methodology with random design is performed. We have considered simple linear regression, but the procedure also applies to multiple regression, we emphasize that the results can be extended to multiple regression. For example, other models can be treated with several predictor variables which can have a general linear. This would seem to complicate the analysis and even the interpretations.
Hermite polynomials led the use of the Gaussian density function for the random explanatory variable X. The method does not work for categorical predictors because qualitative variables cannot have a Gaussian density which is continuous.
Extensions to this situation should be possible using other polynomials and will be explored in future research. These polynomials can be those of Legendre when X is uniform over (−1, 1), Chebyshev with X of Beta probability distribution (1/2, 1/2) over (−1, 1) or Laguerre are when X is of gamma distribution on (0, ∞).
The proposed test was found to have reasonable properties in a simulation study. A small power study was carried out to compare the performances of the test. Some properties of the proposed test were discussed. Applications to simulated data suggested that the proposed test can improve the estimate of the function de regression.

Proof of Proposition 4
To describe the technical development, we write In other words we have under H 0 To handle this problem, there are two parts.
The first term of (18) : 1 Knowing that Eε n = 0 Eε 2 n = 1; we calculate E H 2 k (X j ) α 2 X 2 j + 2αε j X j + ε 2 j : We have E H 2 k (X j ) = k!, we calculate E X 2 j H 2 k (X j ) . According to the formula 4.23 in (Declercq (1998)), we get By posing s k = 1 l=0 k l 1 l 2l l = 2k + 1. We deduce The second term (18): 1 n 2 n j =j E H k (X j )H k (X j ) ε j ε j + αX j ε j + αX j ε j + α 2 X j X j We use the fact that X j and X j are independent for j = j First we calculate the expectation of H k (X j )H k (X j ) ε j ε j + αX j ε j + αX j ε j + α 2 X j X j : Idem for the term III) Combining the first term of (17) and the second term (17) we have var( c k ) = k! n ((2k + 1)α 2 + 1)
• For k ≤ l we complete, by symmetry, the matrix of variance. b-Second term (22) = EH k (X j )H l (X j ) ε 2 j + X j αε j + X j αε j + X j X j α 2 = Eε 2 j EH k (X j )H l (X j ) + αEε j × E X j H k (X j )H l (X j ) +αEε j × E X j H k (X j )H l (X j ) + α 2 E X j X j H k (X j )H l (X j ) = α 2 EX j X j H k (X j )H l (X j ) = α 2 EX j H k (X j )EX j H l (X j ) = 0

Proof of Theorem 5
We use the theorem: Let (X i ) i be a sequence of independent random variables, of the same law and square integrable (and not constant). Note µ = EX 1 , σ 2 := varX 1 with σ > 0 S n := n j=1 X j . Then First we pose Z j = H k (X j )X j = H k (X j ) are independent because the X j are independent. We have The expectation and the variance of c k1 are calculated: • E c k1 = αEZ 1 = 0. Indeed µ = EZ 1 = E (X j H k (X j )) = E (H 1 (X j )H k (X j )) = 0 • α 2 varZ 1 = α 2 k!. Since k ≥ 2 α 2 V arZ 1 = α 2 E H 2 1 (X j )H 2 k (X j ) = α 2 k!
So we deduce that c k1 ∼ N 0, α 2 k! n at k fixed.
We conclude that 1 n n j=1 H k (X j )ε j converges in probability to 0 when n → ∞.