On the Exact Two-Sided Tolerance Intervals for Univariate Normal Distribution and Linear Regression

Statistical tolerance intervals are another tool for making statistical inference on an unknown population. The tolerance interval is an interval estimator based on the results of a calibration experiment, which can be asserted with stated confidence level 1 − α, for example 0.95, to contain at least a specified proportion 1 − γ, for example 0.99, of the items in the population under consideration. Typically, the limits of the tolerance intervals functionally depend on the tolerance factors. In contrast to other statistical intervals commonly used for statistical inference, the tolerance intervals are used relatively rarely. One reason is that the theoretical concept and computational complexity of the tolerance intervals is significantly more difficult than that of the standard confidence and prediction intervals. In this paper we present a brief overview of the theoretical background and approaches for computing the tolerance factors based on samples from one or several univariate normal (Gaussian) populations, as well as the tolerance factors for the non-simultaneous and simultaneous two-sided tolerance intervals for univariate linear regression. Such tolerance intervals are well motivated by their applicability in the multiple-use calibration problem and in construction of the calibration confidence intervals. For illustration, we present examples of computing selected tolerance factors by the implemented algorithm in MATLAB.


Introduction
Statistical tolerance intervals are interval estimators used for making statistical inference on population(s), which can be fully described by a probability distribution from a given family of distributions (as e.g., the family of normal distributions).For more details on different types of statistical intervals consult, e.g., the following books: Hahn and Meeker (1991), Krishnamoorthy and Mathew (2009), and Liu (2011).
Although the concept of statistical tolerance intervals has been well recognized for a long time, surprisingly, it seems that their applications remain still limited.The reliable algorithms for computing the exact tolerance factors are missing in the commonly used statistical packages (even for inferences on normal populations), however, more or less accurate approximations are available.Implementations of such algorithms (mainly based on approximate and/or Monte Carlo methods) are currently under fast development, as, e.g., in the package tolerance for R, see Young (2010).Thus, possible applications should rely either on implemented approximate methods or on published collections of tables for tolerance factors (if available), see e.g. the book Odeh and Owen (1980), which gives many of the most required factors in the context of the normal distribution, however, with limited precision.Due to the recognized importance of statistical tolerance intervals in technical applications, ISO (the International Organization for Standardization) has currently prepared a revised version of the ISO standard 16269-6 (Statistical interpretation of data -Part 6: Determination of statistical tolerance intervals), which also provides detailed tables of tolerance factors for selected tolerance intervals.
The theory of statistical tolerance intervals, as well as the computational methods and algorithms, have been developed significantly during the last three decades.This, together with the fast growing computational power of the personal computers, allows development of fast, efficient and reliable implementations of the algorithms for highly precise computing of the exact tolerance factors and limits required for the statistical tolerance intervals.For a comprehensive overview of the recent advances and developments in this area see Krishnamoorthy and Mathew (2009).
In this paper we shall briefly overview the theoretical background and describe some computational approaches for computing the exact tolerance factors for two-sided statistical tolerance intervals based on sample(s) from normal (Gaussian) population(s).Moreover, we shall also present a method for computing the exact simultaneous two-sided tolerance intervals for normal linear regression by using the method for computing the simultaneous tolerance factors for several independent univariate normal populations.
Based on that, we have developed a MATLAB algorithm for efficient and highly precise computation of the exact tolerance factors for the non-simultaneous as well as simultaneous two-sided tolerance intervals for several independent univariate normal populations.This can be used also for computing the exact tolerance factors for the non-simultaneous two-sided tolerance intervals, and also (in combination with other optimization procedures, based on Monte Carlo simulations) for computing the exact simultaneous two-sided tolerance intervals for univariate normal linear regression.
The methods and algorithms can be further used in the multiple-use calibration problem for constructing the appropriate simultaneous interval estimators (calibration confidence intervals) for values of the variable of primary interest, say x, based on possibly unlimited sequence of future observations of the response variable, say y, and on the results of the given calibration experiment, which was modeled/fitted by a linear regression model.Such calibration intervals can be obtained by inverting the simultaneous tolerance intervals constructed for the regression (calibration) function.For more details see, e.g., Scheffé (1973), Mee, Eberhardt, and Reeve (1991), Mee and Eberhardt (1996), Mathew andZha (1997), andChvosteková (2013b).

Two-sided tolerance intervals for univariate normal distribution
First, let us consider a simple calibration experiment, say E, which is represented by a random sample of size n from a population whose distribution is characterized by a univariate normal distribution N (µ, σ 2 ), i.e.Y 1 , . . ., Y n , where Y i are independent random variables normally distributed, Y i ∼ N (µ, σ 2 ), where µ and σ 2 are unknown parameters (mean and variance) of the population distribution.
Notice that the available information on distribution of the unknown population, based on the result of experiment E, is fully characterized by the random sample, or equivalently by the sufficient statistics: the sample mean, Ȳ = 1 n n i=1 Y i , and the sample variance, S 2 = Under given assumptions, it is well known that the sufficient statistics are independent random variables and their distribution is given by Ȳ ∼ N (µ, δ 2 σ 2 ), where δ 2 = 1 n , and S 2 ∼ σ 2 1 ν χ 2 ν , where ν = n − 1 denotes the degrees of freedom (DFs) and χ 2 ν represents a chi-square distribution with ν DFs.
2.1.Two-sided tolerance intervals for one univariate normal distribution Given the result of the calibration experiment E, we wish to construct a two-sided tolerance interval (i.e., a random interval (L E , U E ), with its limits depending on the result of the experiment E), which can be asserted with confidence level 1 − α (for example 0.95) to contain at least a specified proportion 1 − γ (for example 0.99) of the items in the population under consideration.
That is, we wish to construct the two-sided (1 − γ, 1 − α)-tolerance interval which will cover a pre-specified proportion of possibly infinite sequence of independent future realizations of the response variable Y = µ + σ (with ∼ N (0, 1) assumed to be independent of the calibration experiment E) such that Notice that the confidence level 1 − α is related to the random nature of the outcome (result) of the calibration experiment E. That is, the required two-sided tolerance interval will cover more than (1 − γ) × 100% proportion of the items of the unknown (normal) population, and this will be true in (1 − α) × 100% cases of the hypothetical calibration experiments.
In general, there are potentially many possible approaches to finding a solution to the problem as specified by (1).There is no unique solution until the form of the tolerance limits of the two-sided tolerance interval (L E , U E ) is reasonably restricted.Commonly, the tolerance limits are considered in the form where κ denotes the tolerance factor (a subject of the required solution) which depend on the stated coverage and confidence probabilities (1 − γ and 1 − α, respectively), and further on the parameters characterizing the design of the experiment, δ 2 and ν.So, if necessary, we can emphasize the dependence of the tolerance factor κ on other parameters by writing either κ(1 − γ, 1 − α, δ 2 , ν), or κ(δ 2 , ν), etc.
Consequently, the following conditional probability statement (conditional for given result of E) should be fulfilled for (1 − α) × 100% of the possible results of the calibration experiment (i.e., Ȳ and S 2 ) where The content function C(κ | Z, Q) cannot be evaluated directly for given κ and the observed result of the experiment E ( Ȳ , and S 2 ), as it depends on the unknown parameters µ and σ 2 .However, if we are interested in the stochastic properties of the tolerance intervals based on a large number of results of the hypothetical calibration experiments (i.e., the variability of the results of independent calibration experiments is to be considered), then Z and Q are independent pivotal random variables with known probability distributions independent of the unknown parameters µ and σ 2 , i.e.Z ∼ N (0, 1) and Q ∼ χ 2 ν .So, the content function C(κ; Z, Q), now considered as a random variable (a function of random variables Z and Q), can be used directly for checking the stochastic properties (the true confidence level) of the tolerance intervals, for any candidate value of the tolerance factor κ.
In particular, the tolerance factor κ is exact for the where I(•) is an indicator function, with I(true) = 1 and I(false) = 0, and E {Z,Q} (•) denotes the expectation operator with respect to the distribution of the random variables Z and Q.
Consequently, by applying a suitable iterative optimization procedure, C(κ; Z, Q) can be used for computing the exact value of the tolerance factor κ, such that it fulfills the required property given by (1), or (4), respectively.This may be realized either by using (repeated) Monte Carlo simulations, or two-dimensional numerical integrations.
The below presented formula for computing the exact tolerance factor κ of the two-sided (1 − γ, 1 − α)-tolerance intervals for a univariate normal distribution requires (repeated) evaluation of one-dimensional integral, only.As we shall discuss in more details in the next Sections, the approach can be generalized also for computing the tolerance factor for other models based on normal distribution (as, e.g., the non-simultaneous, point-wise tolerance intervals, as well as the simultaneous tolerance intervals for normal linear regression models), however, with possibly needed evaluation of multivariate integrals.
Notice that for a fixed δ and Z the function Φ(δ|Z| + r) − Φ(δ|Z| − r) is an increasing function of r.Let us denote by r 1−γ the solution to the equation 5), the problem can be rewritten equivalently as where ∼ N (0, 1).For fixed Z, the random variable ( − δ|Z|) 2 ∼ χ 2 1 (δ 2 Z 2 ), i.e. it has a noncentral chi-square distribution with one degree of freedom and the noncentrality parameter ).Thus, the tolerance factor κ defined by (1) and ( 2) is given implicitly as a solution to the equation where E {|Z|} (f (|Z|)) denotes the expectation of the function f (|Z|), with respect to the distribution of |Z|, where Z ∼ N (0, 1), F χ 2 ν (•) denotes the CDF of a chi-square distribution with ν degrees of freedom, Γ(•, •) is the incomplete regularized upper gamma function, and φ(z) denotes the PDF (probability density function) of a standard normal distribution.From computational point of view, the value χ 2 1;1−γ (δ 2 z 2 ) = r 2 1−γ can be computed more efficiently by directly solving the equation ( 5), i.e.Φ(δz + r 1−γ ) − Φ(δz − r 1−γ ) = 1 − γ, than by using a dedicated algorithm for computing quantiles of the non-central chi-square distribution.

Two-sided tolerance intervals for several independent univariate normal distributions with common variance
Here we consider a calibration experiment E = {E 1 , . . ., E m } which is based on m + 1 sufficient statistics, Ȳ1 , . . ., Ȳm and S 2 , where Ȳi = 1 , where n i is the sample size of the ith population.
We wish to construct a set of simultaneous two-sided tolerance intervals (L E,i , U E,i ), with limits , such that where Y i ∼ N (µ i , σ 2 ) are mutually independent random variables, independent from the calibration experiment E = {E 1 , . . ., E m }.
For a given (candidate) set of the tolerance factors, say κ 1 , . . ., κ m , the content function for the simultaneous tolerance intervals (L E,i , U E,i ) is given by where The set of tolerance factors κ 1 , . . ., κ m is exact for the simultaneous This may be checked either by a Monte Carlo simulation, or by (m+1)-dimensional numerical integration.
Notice, however, that the solution to the equation ( 10) is not unique, until further restrictions are imposed on the set of possible tolerance factors κ 1 , . . ., κ m .Frequently, it is required to have a common tolerance factor κ for all simultaneous tolerance intervals, i.e.
Under such restriction, the formula (7) can be generalized for computing the exact common tolerance factor κ of the simultaneous tolerance intervals, with limits , such that (8) holds true.However, a relatively simple generalization is possible only under further restrictive assumption that the calibration experiment E = {E 1 , . . ., E m } is based on m independent samples with common sample size n for each population N (µ i , σ 2 ), i.e. with ν = m(n − 1).In particular, under this restriction we get the content function as Φ(a + r) − Φ(a − r) is a decreasing function of |a|.
Then, using the analogy of ( 7), the generalized formula is derived by considering the distribution of the random variable Z m max = max(|Z 1 |, . . ., |Z m |) (where Z i ∼ N (0, 1), i = 1, . . ., m, are independent random variables) instead of |Z| (where Z ∼ N (0, 1)).In summary, the exact (simultaneous) tolerance factor κ can be computed as a solution to the equation where δ 2 = 1 n , ν = m(n − 1), and χ 2 1;1−γ (δ 2 z 2 ) denotes the (1 − γ)-quantile of the non-central chi-square distribution with 1 degree of freedom and the non-centrality parameter For m = 1, the tolerance factor given by the solution to the equation ( 12) is equivalent to the factor given by the solution to the equation ( 7) with ν = m(n − 1).Application of such a tolerance factor leads to the non-simultaneous tolerance intervals with limits for the considered m populations, each fulfilling the property as defined by ( 1), but formally different from the individual tolerance intervals defined by ( 2

One-sided tolerance intervals
For completeness (however, without more details), we note that the tolerance factor for the one-sided ) can be computed as a quantile of the non-central t-distribution.In particular, the exact tolerance factor for the (non-simultaneous) upper tolerance limit where with δ 2 = 1 n and z 1−γ being the (1 − γ)-quantile of the standard normal distribution, and t ν,∆;1−α denotes the (1−α)-quantile of the non-central t-distribution with ν degrees of freedom and the noncentrality parameter ∆.For more details see Krishnamoorthy and Mathew (2009), equations (1.2.2) and (2.2.3).
We notice an interesting (technical) relationship of the right hand side expression of the equation ( 7) to the CDF of the noncentral t-distribution with ν degrees of freedom and the noncentrality parameter ∆, say F t ν,∆ (•).In particular, This relationship allows to use similar computational strategies for computing the required tolerance factor κ, as for computing the CDF of the noncentral t-distribution.For more details see Witkovský (2013a).
Based on (14), and using the analogy of ( 12), the exact common tolerance factor κ for simultaneous one-sided upper tolerance limits U E i = Ȳi + κ √ S 2 (for m independent, equally sampled normal populations with possibly different means µ i , common variance σ 2 , and with common sample size n) can be computed as a solution to the equation where ν = m(n − 1), δ 2 = 1 n , and z 1−γ is the (1 − γ)-quantile of the standard normal distribution.For more details and alternative derivation see Krishnamoorthy and Mathew (2009), equation (2.5.3).

Two-sided tolerance intervals for univ. normal linear regression
Here we shall assume that the calibration experiment E is modeled by the linear regression model Y = Xβ + ε, where Y is an n-dimensional random vector of responses measured for n values x i , i = 1, . . ., n, of the explanatory variable x ∈ X ⊆ R r .However, here we shall assume that the explanatory variable is one-dimensional, i.e. that x ∈ (x min , x max ) ⊆ R, what is a typical situation for the frequently used p-order polynomial regression models.
The matrix X represents the (n × q)-dimensional calibration experiment design matrix with rows f (x i ) , for i = 1, . . ., n, i.e. q-dimensional functions of r-dimensional vectors x i .For example, in simple p-order polynomial linear regression model we get q = p + 1 and f (x i ) = (1, x i , x 2 i , . . ., x p i ) for x i ∈ X = (x min , x max ).For simplicity, here we shall assume that X is a full-ranked matrix.Further, β is the q-dimensional vector of regression coefficients and ε is an n-dimensional vector of measurement errors with its assumed distribution ε ∼ N (0, σ 2 I n ).Based on the calibration experiment E, we get the sufficient statistics and mutually independent pivot variables where ν = n − q.

Non-simultaneous tolerance intervals
The non-simultaneous tolerance intervals for the possible future realizations of the response variable Y (x) = f (x) β + σ (where f (x) is a known q-dimensional model function of x ∈ X and ∼ N (0, 1) is independent of the calibration experiment E), say (L E,x , U E,x ), are such that Similarly as in the univariate distribution case, the limits of the two-sided tolerance intervals for linear regression, (L E,x , U E,x ) for x ∈ X , are typically restricted to the form where by κ x we denote the required tolerance factor at x ∈ X .
Then, for the given candidate of the tolerance factor, say κ x , the content function for the non-simultaneous tolerance interval (L E,x , U E,x ) is where By comparing (3) and (20), it is clear that the exact tolerance factor κ x for the two-sided non-simultaneous tolerance interval (L E,x , U E,x ), evaluated at x ∈ X , can be computed by solving the equation ( 7), with ν = n − q and δ 2 = δ 2 x = f (x) (X X) −1 f (x).Notice that the value of the exact tolerance factor κ x does not depend directly on the vector x ∈ X and the model design matrix X.In fact, it depends on the model design only through ν = n − q and δ 2 x .That is, κ x is equal for all x ∈ X , such that δ 2 x = f (x) (X X) −1 f (x) is equal.This allows creation of tables and/or efficient interpolation-based approximations for computing the exact non-simultaneous tolerance factors κ x for the univariate normal linear regression models.

Simultaneous tolerance intervals
The simultaneous two-sided tolerance intervals for a possibly infinite sequence of the future realizations of the response variable Y (x) = f (x) β + σ , say (L E (x), U E (x)) for any x ∈ X , are such that Similarly as before, we consider the limits of the tolerance intervals to be restricted to the form where by κ(x) we denote the tolerance factor function defined for all x ∈ X .
Then, for a given candidate of the tolerance factor function, say κ(x), the content function for the simultaneous tolerance intervals (L E (x), U E (x)) is given by where Notice that the content function ( 23) depends on the design matrix X, in particular through the matrix (X X) −1 .
The tolerance factor function κ(x) is exact for the simultaneous This may be checked either by a Monte Carlo simulation, or by (q + 1)-dimensional numerical integration.In general, evaluation of ( 23) and/or (24) is a computationally demanding task, as it requires minimum search over x ∈ X for each evaluation at Z X , Q.
The solution to the equation ( 24) is not unique, until further restrictions are imposed on the form of the tolerance factor function κ(x).In accordance with Witkovský (2013b), here we suggest considering the family of the candidate tolerance factor functions κ(x), parametrized by the scalar parameter m ≥ 1, of the form where the function κ δ 2 (x), ν, m is given implicitly, for each x ∈ X , as a solution to the equation ( 12), by setting and with m = m.
Here, the parameter m (the simultaneousity parameter to be determined) represents the complexity of the regression function f (x) β over the considered range x ∈ X .The optimum value of m depends on the model and the design of the calibration experiment E: the model function (e.g., the polynomial of order p), the considered set X , the design matrix X, and the degrees of freedom ν.For example, in simple linear regression (polynomial of the order p = 1) the value m = 2 is a good starting point for the numerical (iterative) search procedure (i.e., the complexity of the simple linear regression function for all x ∈ X is assumed to be similar to the complexity of two independent normal populations).
Another possibility, suggested in Mee et al. (1991), is to consider the family of functions κ(x) = κ(δ(x)), linear functions of δ(x) = f (x) (X X) −1 f (x), parametrized by the scalar parameter λ > 0 (a parameter to be determined).In particular, where z 1− γ 2 is the (1 − γ 2 )-quantile of the standard normal distribution.Based on that, Mee et al. (1991) derived their optimum tolerance function κ(δ(x)) (however, not exact) as a solution to the equation by using the approximate content function where the range of δ(x) is considered for x ∈ X , and W ∼ χ 2 q is independent of Q ∼ χ 2 ν .Notice that the content function ( 28) does not depend directly on the design matrix X.

Multiple-use calibration problem
A motivation for computing tolerance intervals for the univariate normal linear regression is the multiple-use calibration problem and the associated problem of computing the calibration confidence intervals.
In many experimental sciences, acquisition of the measurement results frequently requires measurement procedures involving instrument calibration which can be modeled as a linear (polynomial) regression problem.Then, the required measurement result, say x * , is obtained through measuring the observable response variable, say Y * = Y (x * ) = f (x * ) β + σ , and by inverting the fitted regression (calibration) function.A problem of constructing and computing the appropriate confidence intervals for the unobservable values of the explanatory variable x * , based on a given fitted calibration function (a result of the calibration experiment), for a possibly unlimited sequence of future observations of the response variable Y * , is known as the multiple-use calibration problem.
In particular, for given observation Y * = Y (x * ), we shall construct the calibration confidence interval for the unobservable value of the explanatory variable, say x * ∈ X , by inverting the simultaneous tolerance intervals.So, the calibration confidence interval for x * is given by the random set The set ( 29) is not necessarily an interval.However, for most practical situations where the calibration function is (significantly) strictly monotonic for x ∈ X , the confidence set (29) typically results in an interval.Based on ( 21) and ( 29), we can directly characterize the stochastic properties of the calibration confidence intervals: We notice, however, that from the practical point of view, such calibration confidence intervals are considered to be too conservative, and consequently, as suggested in Mee and Eberhardt (1996), usage of the non-simultaneous two-sided tolerance intervals (L E,x , U E,x ) is recommended in (29), instead of using the exact simultaneous two-sided tolerance intervals (L E (x), U E (x)).Example 5.The algorithm can be used directly for computing the exact tolerance factors of the non-simultaneous two-sided tolerance intervals for normal linear regression models, and also, by using further optimization (used for finding the optimum value of m) for computing the exact tolerance factors of the simultaneous two-sided tolerance intervals.
For illustration, let us consider a calibration experiment for simple linear regression: Y = Xβ + ε, where X is an (n × 2) design matrix with n = 20.The first column of X, representing the intercept, is a column of ones, the second column has two distinct elements: −1 for the first 10 rows and 1 for the last 10 rows.So, (X X) −1 is a diagonal matrix with both diagonal elements equal to 1 n = 0.05, and consequently, δ(x) = (1, x)(X X) −1 (1, x) = 1 n (1 + x 2 ).We wish to compute the tolerance factors for the two-sided tolerance intervals with x ∈ X = (−2, 2), i.e. for δ(x) ∈ (δ min , δ max ) = 1 n , 1 n (1 + 2 2 ) = (0.2236, 0.5). Figure 1 plots the values of the exact non-simultaneous, the exact simultaneous and the approximate tolerance factors, calculated for 15 equidistant values of δ(x) ∈ (δ min , δ max ).The exact non-simultaneous tolerance factors were calculated by ( 12), with n = 20, ν = n−q = 18, and m = 1.The exact simultaneous tolerance factors were calculated according to ( 24) and ( 25) with m = 4.3, found by Monte Carlo based optimization, and ν = n − q = 18.The approximate tolerance factors were calculated by ( 26), with the optimum value of the parameter λ = 1.2057 taken from Table 2 in Mee et al. (1991), for n = 20 and τ = 2.Here is the MATLAB code used for computing the tolerance factors: and Φ(•) is the cumulative distribution function (CDF) of the standard normal distribution.So, C(κ | Z, Q) represents the proportion of the population covered by the tolerance interval for the given tolerance factor κ and for the given result of the calibration experiment E. C(κ | Z, Q) is commonly known as a content function.