A study of convolution models for background correction of BeadArrays

The RMA, since its introduction in \cite{Iri03a, Iri03b, Iri06}, has gained popularity among bioinformaticians. It has evolved from the exponential-normal convolution to the gamma-normal convolution, from single to two channels and from the Affymetrix to the Illumina platform. The Illumina design has provided two probe types: the regular and the control probes. This design is very suitable for studying the probability distribution of both and one can apply the convolution model to compute the true intensity estimator. The availability of benchmarking data set at Illumina platform, the {\it Illumina spike-in}, helps researchers to evaluate their proposed method for Illumina BeadArrays. In this paper, we study the existing convolution models for background correction of Illumina BeadArrays in the literature and give a new estimator for the true intensity, where the intensity value is exponentially or gamma distributed and the noise has lognormal distribution. We compare the performance of the models on the Illumina spike-in data set, based on various criteria, for example, root and mean square error, $L_{1}$ error, Kullback-Leibler coefficient, and some adapted criteria from Affycomp \cite{Cop04}. We then provide a simulation study to measure the consistency of the error of background correction and the parametrization. We also study the performance of all models on the FFPE data set. Our study shows that our GLNn model is the optimal one for the benchmarking data set with benchmarking criteria, while the gamma-normal model has the best performance for the benchmarking data set with simulation criteria. At the public data set of FFPE, the gamma-normal and the exponential-gamma models with MLE cannot be used and our proposed models ELN and GLN have the best performance, showing a moderate error in background correction and in the parametrization.


Introduction
There are various processes in producing data from microarray experiments and each process contributes a noise to the data. The noise can be of two types, biological and non-biological. Non-biological noise should be avoided or at least minimized.
Sources for the non-biological noises are, for example, the chip itself, the scanner, or fluctuations in the electric network. Therefore, the data needs to be adjusted. The pre-processing will adjust the intensity value ( [13,14]) and provides the estimation of the true intensity.
Irizarry et al. [15][16][17] and Bolstad et al. [3], on the Affymetrix platform, have estimated the true intensity values based on the convolution model in the background correction step of their robust multi-array average (RMA) pre-processing method.
They assumed that the true intensity S is exponentially distributed and the background noise B is normally distributed.
Plancade et al. [20,21] showed that the RMA model (in [3] and [15][16][17]) does not fit Illumina BeadArrays: using the exponential-normal convolution leads to a large distance between the observed and the modeled intensities. They proposed, instead, the implementation of gamma distribution for the intensity value and normal distribution for the noise.
The simulation study of Plancade et al. [20,21] showed that the gamma-normal model performs better than the existing exponential-normal convolution model, giving a more accurate and correct fit for the observed intensities in Illumina BeadArray.
Using gamma distribution for the intensity values in Illumina BeadArray has been first suggested by Xie et al. [30].
The studies of Baek et al. [2] (in the background correction of the image processing) and Chen et al. [4] show that the noise distribution is usually skewed in different degrees. In their studies, based on simulated and real data sets, Baek et al. [2] conclude that the gamma distribution is well suited for the noise. It accounts for the intensities with a positive lower bound and is very flexible in its shape, including asymmetric exponential type and symmetric normal type.
The proposed convolution of exponential-gamma distribution by Chen et al. [4] improves the intensity estimation and the detection of differentially expressed genes in the case when the intensity to noise ratio is large and the noise has a skewed distribution.
In view of the remarks above, it is natural to model both the true intensity and the background noise in Illumina BeadArray as gamma distributed. In an earlier version of this paper we have developed an estimator for the true intensity based on the gamma-gamma convolution model of RMA. However, this model does not fit very well the Illumina benchmarking data set. Independently, Triche et al. [26] proposed and applied the gamma-gamma model to pre-process Illumina methylation arrays.
In this paper we introduce a new model for background correction in Illumina BeadArrays where the true intensity value is exponentially or gamma distributed and the noise has lognormal distribution. As we will see, this model avoids the difficulties with the gamma-gamma model and has an overall satisfactory performance.
We note that a new method reducing the bias of the maximum likelihood estimator of the shape parameter of the gamma distribution was proposed by Zhang [31]. But since our samples are very large, bias is not a problem in our studies.
Our paper is organized as follows. In Section 2 we describe the Illumina BeadArray technology and in Section 3 we review previous work related to the background correction for Illumina BeadArrays. Our model is described in Section 4. Sections 5 and 6 are explaining the benchmarking and the simulation studies in Section 7 provide a performance comparison. Finally, Section 8 gives the conclusions and indications of future work.

Illumina BeadArrays technology
Illumina technology is one of the most advanced technologies in analyzing gene expression by microarrays. It can be used to profile partially degraded ribonucleic acid (RNA) which is usually found in FFPE samples, by the cDNA-mediated Annealing, Selection, Exten-sion, and Ligation (DASL) assays method.
The huge amount of available FFPE data makes the the Illumina platform very important because of the nature of the DASL assay method, which can deal with the partially degraded RNA to profile the gene expression on the samples.
The Illumina platform has small feature size, dense features and the ability to analyze multiple samples in parallel. Illumina provides two formats of microarrays ( [8], [9], [19] and [25]), the Sentrix R Array Matrix (SAM) and the Sentrix BeadChip (SBC). The Array Matrix arranges fiber optic bundles, each containing 50,000 fibers within distance 5-µm, into an Array of Arrays TM format of a 96-well microtiter plate. On one end of the fiber optic bundles, the core of each fiber is etched to form a nanowell for the 3-µm silica beads.

See
In the BeadChip format, one to several microarrays is arranged on silicon slides that have been processed by micro-electromechanical systems (MEMS) technology to also have nanowells that support the self assembly of beads.  [25] explain the three parts of Illumina arrays manufacturing (see Figure 2.4). The three parts are:

Steemers and Gunderson
1. The first part is the creation of a master bead pool consisting of 1,536-250,000 different bead types. For the quality control, it includes the negative control beads. Oligonucleotide capture probes are immobilized individually by bead type in a bulk process. Each bead type in an array comes from a single immobilization event, reducing array-to-array feature variability. The design of Illumina bead can be seen at Figure 2.3.
2. The second step is the random self assembly of the master pool of bead types into etched wells on the array substrate, where each bead type has an average 30 times representation -a strategy that provides the statistical accuracy of multiple measurements.
3. The third step is the identification of each bead on the array, through a decoding process. This process provides information of each bead and performs a quality control of the feature in every array.

Previous work
Affymetrix is the pioneer and most widely used platform for microarray gene expression experiments. The tools and algorithms to handle the data are numerous, both free and commercial. Some methods for pre-processing are available. Examples for the background correction step are: MAS5.0 by Affymetrix, multiplicative model based expression index (MMBE) by Li and Wong [18], RMA in Irizarry et al. [15][16][17] and Bolstad et al. [3], GC-RMA by Wu et al. [29] and maximum likelihood estimation based on the normal-exponential convolution model by Silver et al. [24].
Illumina is one of the alternative platforms and is increasingly popular. A few statistical methods have been developed for BeadArray data and there is no consensus yet for the pre-processing steps [23]. Xie et al. [30] mention that for the background correction step, Illumina bead studio gives two options (no background correction and background substraction) and the packages for BeadArrays in R provide three options (no background correction, background substraction and RMA background correction).
Ding et al. [6]  The studies of Chen et al. [4] and Plancade et al. [20,21] show that their background correction models are made by adapting the RMA Affymetrix model. As Forcheh et al. [10], pointed out, most preprocessing methods for Illumina bead arrays are ported from the Affymetrix microarray platform.

Background correction by RMA
In modeling the intensity values, the RMA ( [3], and [15][16][17]) assumes that the intensity values are affected by the noises of the chip. The RMA model is as follows: where P = P M is the observed probe level intensity of perfect match probes, S is the true signal, with S ∼ f 1 (s; θ) = Exp(θ), θ > 0, and B is the background noise of To avoid negative intensity values, we truncate B at 0 from below; this will not change its density function Assuming independence, the joint density of the two-dimensional random variable Furthermore, the transformation formula for two-dimensional densities gives that joint density of S and P is The background adjusted intensity is computed by the estimated signal given the observed intensity. It is the conditional expectation The substitution s = µ S,P + σt yields and thus the estimator is written as [3] Note that modelling the noise as a truncated normal variable has the consequence that the noise equals 0 with a positive probability p 0 , a rather unpleasant feature of the model. As pointed out in [30], however, in practical cases p 0 is rather small, so this problem can be disregarded. To avoid this difficulty, one can model the noise as the absolute value of an N (µ, σ 2 ) variable, which changes the calculations above.
However, since in this paper we will provide a background correction model fitting reality considerably better, we do not give the details here.

Exponential-normal MBCB
Xie et al. [30] use the same underlying distributions in Equation The estimator of the true intensity value of Xie et al. [30] is . (3.7) 2. Under the convolution model (3.1), where the true intensity value is assumed exponentially distributed and the noise is normally distributed, we need to estimate the parameters θ, µ, and σ 2 . Xie et al. [30] offer three parameter estimation methods: method of moments, maximum likelihood and bayesian.
On the other hand, RMA applies the ad-hoc method.

Gamma-normal convolution
Plancade et al. [20,21] introduced gamma-normal convolution to model the background correction of Illumina BeadArray. The model is based on the RMA background correction of Affymetrix GeneChip. Plancade et al. [20,21] assume that the intensity value is gamma distributed and the noise is normally distributed.
Under the model background correction in (3.1), f P is the convolution of f S and f B .
The true intensity S is estimated by the conditional expectation of S given P = p: is the gamma density. When S is gamma distributed and B is normally distributed, then equation (3.8) has no analytic expression like (3.6). Plancade et al. [20,21] implemented the Fast Fourier Transform (fft) to estimate the parameter. Moreover, equation (3.8) can be written as where Θ = (µ, σ, α, β), S and B are independent and sf gam α,β (s) = αβf gam α+1,β (s) is valid for every s > 0.

Exponential-Gamma convolution
Chen et al. [4] proposed for the distribution of the true intensity and its noise, under the convolution model of Equation (3.1), the exponential and gamma distribution, respectively. Therefore, and The corrected background intensity for the proposed model ( [4]) iŝ (3.10)

Estimation of the true intensity value
Consider the model (3.1), when the true intensity S is exponentially distributed, i.e.
S ∼ f 1 (s; θ) = θe −θs , θ, s > 0, and the background noise B is lognormally distributed, The joint density function of S and B equals and thus the joint density function of S and P is Consequently, the marginal density function of P equals Using the substitution ln(p − s) = z we can evaluate the last integral as follows: The conditional density function of S under P = p is now obtained as The true intensity value is estimated by taking the expectation of the conditional density function in (4.2). It is computed as follows: Using the substitution ln(p − s) = z we see that the last integral equals

Parameter estimation
To estimate the parameters θ, µ, and σ in (4.3), we can use various methods.

Maximum likelihood (MLE)
This is implemented by applying the optim function to the likelihood function The log-likelihood function is

Method of moments
The method of moments is implemented as follows:

Estimation of the true intensity value
Consider now model (3.1), when the true intensity S is assumed to be gamma dis- and thus joint density function of S and P is Hence the marginal density function of P is obtained as ds. (4.7) Using the substitution ln(p − s) = z, we get The conditional density function of S under P = p is now obtained as The true intensity value is estimated by taking the expectation of the conditional density function in (4.9) It is computed as follows: Substituting ln(p − s) = z the integral above becomes where

Parameter estimation
To estimate the parameters α, β, µ, and σ in (4.10), we can use either of the following methods.

Maximum likelihood (MLE)
This is implemented by applying the optim function to the likelihood function The log-likelihood function is

Method of moments
The method of moments is implemented as follows: (a) ComputeS =P −B and S 2 S = S 2 P + S 2 B then α and β are estimated byS β and There are approximately about 48,000 probesets for each sample and in addition the 33 spike-in probes are added into it. For the control probes, there are 1,616 probes.
These control experiments are the benchmarking data sets of Illumina and are used to compare low-level analysis methods such as in Affymetrix platform.

Benchmarking Criteria
We adopt the benchmarking criteria from Affymetrix platform, in the Affycomp package, and some criteria which have been used by [30], [4], and [20,21].
Affycomp [5] has provided fourteen criteria and here we define different ranges for each classification. Cope et al. [5] defined the low,medium and high intensities as the nominal concentrations less than or equal to 2 pM, the nominal concentration between 4 and 32 pMs, and the nominal concentrations greater than or equal to 64 pM.
For the Illumina Spike-in data set, the high, medium and low concentration are defined respectively as, the nominal concentrations less than or equal to 1 pM, the nominal concentration between 3 and 30 pMs, and the nominal concentrations greater than or equal to 100 pM. Instead of using the slope, we used the R 2 , because it is reflecting the best fit of the data: the quantity to measure how much percentage that the observed expressions is explained by the nominal concentrations.
Criteria 1 to 9 below were computed by the author and Criteria 10 to 14 were computed by implementing the assessSpikeIn2 and the assessSpikeIn functions from the affycomp package, with some adjustments. The benchmarking criteria are as follows: 1. Median SD. It is believed that the variance of an expression measure across replicate arrays should be low, so the standard deviation (SD) will be low too, ideally zero. The median of the standard deviations across replicate arrays is chosen as the measurement due to its robustness.
2. Null log-fc IQR. The non-spike in genes should not be differentially expressed across arrays. Therefore, the Inter-quartile range (IQR) of the log-fold-changes of the non-spike in genes is, ideally, zero.

4.
Signal detects R2. The R-squared (R2) is obtained from regressing the expression values on nominal concentrations in the spike-in data. The ideal value of R-squared is 1, because ideally an increment in the nominal concentration will be followed by an increment of the expression values, in the same scale.

5.
Low.R 2 . This is obtained from regression of observed log concentration on nominal log concentration for genes with low intensities.
6. Med.R 2 . This is obtained from regression of observed log concentration on nominal log concentration for genes with medium intensities.
7. High.R 2 . As above but for genes with high intensities.
8. Obs-intended-fc R 2 . The R 2 that is obtained by regressing observed log-foldchanges against nominal log-fold-changes for the spike in genes.
9. Obs-(low)int-fc R 2 . The R 2 that is obtained by regressing observed log-foldchanges against nominal log-fold-changes for the spike in genes with low intensities.
10. Low AUC. This is computed as the Area under the receiver operator characteristic (ROC) curve (up to 100 false positives) for genes with low intensities, and standardized. Therefore, the optimum value is 1.
11. Med AUC. As above but for genes with medium intensities.
12. High AUC. As above but for genes with high intensities.
13. Weighted avg AUC. A weighted average of the previous 3 ROC curves with weights related to amount of data in each classification (low, medium and high).

Affycomp Plot
Affycomp contains some plots that are used as the supplemental support in the process of benchmarking against spike-in and dilution data set. Some of them are as in [5] and http://affycomp.biostat.jhsph.edu.

Observed fold-change versus nominal fold-change
The plotting of log fold-change observation and nominal is used for validation of differentially expressed genes.

Reproducibility
To assess which models reproduce best the benchmarking data, two measurements are applied (see e.g. Shamilov [22]): x is the true intensity estimator and y is the observed intensity.

Simulation study
Let N be the number of simulations, n 1 the sample size of regular probes, n 2 the sample size of negative probes, Θ is the original parameter vector of the underlying distribution in each model from the data set andΘ is the parameter vector of the underlying distribution in each model from the simulation data.
The lower the criteria the better the model would represent the right model for the data at hand.
We will call the methods above, respectively, as follows:    First, we compute the adopted Affycomp benchmarking criteria, based on the data after background correction and their log transformation.

Illumina Spike-in
Second, in the simulation, the M SE bc and the L 1 error will be computed based on the log transformation of the experiment and the nominal concentration data.
The log transformation that we use in this paper, respectively, for the benchmarking and the FFPE data sets are as follows y = log(x + (x 2 + 1), base = 2) and y = log(x + 1 + (x 2 + 1), base = 2) (7.1) where x is the concentration or the intensity value.

Non-simulation
In Table 3 it is shown that the ENr provides the smallest variation and IQR and the GLNn model provides the smallest 99.9% percentiles of log fold change for the non spike-in between replicates. The largest variation, IQR, and 99.9% percentiles, respectively are the GLNm, the ELNm and the GN.  In Table 4, it is shown that, in general, all methods perform similar to each others.
The GLNn models have the highest signal detect R 2 . The GN model has the highest R 2 at low concentration but has the lowest R 2 at high concentration. This means that the GN model works better at low concentration. On the other hand the ENr shows that it works better at medium and high concentrations, which is followed closely by GLNn model.
If we divide the concentrations into two categories, where high concentration means that the nominal concentration is at least 3pM and low concentration means that the nominal concentration is at most 1pM, the GLNn model has the highest R 2 (the data is not shown here). It means, in general and at high concentrations, the GLNn offers the best fitted than other models.
As in Table 4, Table 5 shows that all models have similar performance, although the GLNn model has the highest R 2 of nominal concentration against observed log-foldchange.   Table 6 provides the results from the computation of the AUC value. The table shows that all models have a better accuracy at medium concentrations than at low and high concentrations. The ENr, is followed by the GLNn, performs very poor at the low concentrations, but the GLNm is perform at best. At high concentrations, the ENr performs the best and it is followed by the GLNn. But in general, the highest AUC is achieved by all the model with the MLE parameter estimation methods: the GLNm, ELNm, and ENm.
The computation, which is based on the 12 and all arrays, provides the results where all models have the AUC greater than 0.9. According to Zhu et al. [32], the AUC between 0.9 and 1.0 is classified as excellent in measuring the accuracy. Therefore, based on Table 6, we can identify that there are some models excellency accurate in predicting the gene expression.
In the Appendix, we put some supplementary materials, based on the Affycomp criteria from [5] and [15,16] as explained in Section 5 .

Simulation
We do simulations to assess the performance of each model. The bias of the background correction is assessed by the MSE bc , and the bias of the parameterizaton is assessed by L 1 error. These criteria have been defined at Section 6. performance on the MSE bc , but we still can consider its performance good, concerning that the bias of the parameters are similar to other proposed models and GN.
One of the proposed models, GLNm has the highest bias on the MSE bc and the parameter α. In our view this happens because we use an approximation in estimating the true intensity value. The EN models (ENr, ENm, ENn and ENmc) have considerably better performance at M SE bc , but are not good at the parametrization. The bias on the parametrization of the noise is higher than in other models.

FFPE data set
Based on the results from Section 7.1, we compare the performance of these models on the public data sets. We would like to know how good these models are in real data samples. Here, we choose to use the FFPE data set.
Currently the FFPE archival samples are widely available in million and it is a great source of information in medical studies about some diseases, for example cancer. This data type is suffering from the RNA degradation, which leads to poor performance in array-based studies. However, the Illumina's DASL assays could provide high-quality data from this degraded RNA samples.
Comparing the performance of these background correction models certainly would help researchers to choose the appropriate background correction for their data, particularly if their data is the FFPE type.
The background correction for the FFPE data set is implemented in three steps: step 1 Do the quality control (QC) to the raw FFPE data. In this paper, we used the ffpe package in R ( [27]). The results of our computation are in Tables 8 and 9. From these tables, we can see that there are no EG and GN models. Neither of these models can work on these data sets. For some samples in the data set, both models fail to compute the parameters which has the consequence that the true intensity value cannot be provided.
We decided to remove the EG and GN models from further comparisons in both FFPE data sets. Here we provide the results of the rest of the models only.
Tables 8 and 9 consistently show that the bias of the parameters of noise in the EN models are higher than the proposed models. For the parameter β, the ELNn has  The proposed model GLNm continues to show the highest bias in the background correction and the parameter α. As it has been mentioned previously, this is a consequence of the approximative computation of the true intensity value, where we take the approximation until k = 10. There is a possibility to apply different numerical approximation for both of the ELN and GLN models.

Conclusions and indication of future work
We have studied the additive models of background correction for beadarrays and proposed some new models where the true intensity is assumed to have exponential or gamma distribution and the noise is lognormally distributed. We have derived the estimator of the true intensity value of the proposed models.
Further, we compared the performance of all models, based on the benchmarking and public data sets. In the benchmarking data set we adopted the criteria from the Affycomp [5] and for the simulation study we used the criteria which have been used in [30], [4] and [20,21]. For the public data sets, we only used the criteria for the simulation study.
We have seen in Sections 7.1.1 and 7.1.2 that EN, EG, GN and GLN perform rather similarly. From the affycomp criteria we can provide the following points: 1. the ENr and GLNn provide the lowest variation between replicates and all models using the MLE estimation method have a higher variation than others 2. the GLNn model has the highest signal detect R 2 in general and in high concentration. This means the GLNn model is the best fitted for the gene expression.
3. the GLNn model, based on the MvA plot, produces the least number of genes which should not be expressed but are nevertheless expressed. On the other hand, the GN model provides the largest number of such genes.
4. all models with the MLE estimation method have a higher average AUC value, which means that they provide a better accuracy in predicting the gene expression.
In the simulation study, the best performance in the background correction and parametrization error is achieved by the GN model. It is followed by our proposed ELN models. It has been shown that the GLNn does not perform optimally at the MSE bc criterion, but for the parametrization this model still can be considered good.
In the FFPE public data set, the GN and EG models cannot be implemented. This is in strong contrast with the fact that in the simulation study of benchmarking data set, the GN model has the best performance.
The EN models show the highest bias in the parametrization in both public data sets and the lowest bias in the background correction. Our proposed models, except the GLNm, show the lowest bias in the parametrization in both data sets and a moderate bias in the background correction.
Based on the results from the benchmarking data and the public data sets, we would suggest researchers the following: 1. if the GN model works properly at the data set at hand, then use the GN model to correct the background.
2. if the GN model fails, then use our proposed models, particularly the GLNn model. The reason for not choosing the ELN models is that the value of the parameter α from the benchmarking data set is less than 1, around 0.2. Therefore, the gamma model is more appropriate to model the true intensity distribution than the exponential one. We believe that the right approximative computation of the GLN models will lead to a better performance than the current approximation.
The ELN models perform better than the original EN models, due to the fact that not only the regular probes, but also the control probes are skew-distributed ( [4]). Therefore, these models could be the second choice after the GLN, when the GN model does not work.
3. With regard to the computation time, at the benchmarking data set the EN models are working faster than the others. They are followed by the ELNp, ELNn, and EGm. The GLNn and the ENmc are the third fastest, then come the GN and the ELNm, which are followed by the GLNm as the slowest one.
4. As we will show in a subsequent paper, it is possible to develop a new model which satisfies all of the affycomp criteria, the consistency in the background correction and the parametrization errors.
One of the purposes of using microarray technology is finding the genes which are expressed differentially due to some disease or condition. Therefore, it is important to investigate the effect of bias of the background correction and the parametrization toward the differentially expressed genes. This would be our future work.
A MA plots

Variance across replicates
Average log expression