Simulation Tools for Small Area Estimation : Introducing the R Package saeSim

The demand for reliable regional estimates from sample surveys has substantially grown over the last decades. Small area estimation provides statistical methods to produce reliable predictions when the sample sizes in specific regions are too small to apply direct estimators. Modeland design-based simulations are used to gain insights into the quality of the methods utilized. In this article we present a framework which may help to support the reproducibility of simulation studies in articles and during research. The R package saeSim is adjusted to provide a simulation environment for the special case of small area estimation. The package may allow the prospective researcher during the research process to produce simulation studies with minimal coding effort.


Introduction
The demand for reliable small area statistics from sample surveys has substantially grown over the last decades due to their use in public and private sectors.In this paper we present a framework for simulation studies within the field of small area estimation.This tool might be useful for the prospective researcher or data analyst to provide reproducible research.
Reproducible research has become a widely discussed topic.In the field of statistics many open source tools like the R language (R Core Team 2014) and L A T E X, dynamic reporting packages like knitr (Yihui 2013), Sweave (Leisch 2002) and more recently rmarkdown (Allaire, McPherson, Xie, Wickham, Cheng, and Allen 2014), make the integration of text and source code for statistical analysis possible.Publishing source code and data alongside research results draws special attention to authoring the analysis.However, the requirements for source code are different from the written words in the article itself.
Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.(Knuth 1992, p.99)Besides the combination of text and source code, reproducible research aims at the availability of the full academic research, which is the paper combined with the full computational environment, like data and source code.However, real data are often very sensitive and governed by strict confidentiality rules.Synthetic data generation mechanisms as discussed in Alfons, Kraft, Templ, and Filzmoser (2011) or Kolb (2013) can be used to provide safe data which are publicly available to enable the community to reproduce the analysis and results.Burgard, Kolb, and Münnich (2014) interpreted this as an open research philosophy.Such synthetic data sets can be used to test newly proposed statistical methods in a close-to-reality framework.In general, simulation studies in statistics can be divided into two concepts: Design-based: The simulation study is based on true or synthetic data of a fixed population.Then, samples are selected repeatedly from the underlying finite population and different estimation methods are applied in each replication.The estimates so obtained are compared to the true values of the population, for instance, in terms of relative bias (RB) or relative root mean squared error (RRMSE).
Model-based: The simulation study uses data drawn from certain distributions.In each iteration, the population is generated from a model and a sample is selected according to a specific sampling scheme.The sample is used to estimate the quantity of interest for which quality measures (like RB and RRMSE) are derived.
Further discussion regarding model-and design-based simulations is available in Münnich, Schürle, Bihler, Boonstra, Knotterus, Nieuwenbroek, Haslinger, Laaksonen, Eckmair, Quatember, Wagner, Renfer, Oetliker, and Wiegert (2003), Salvati, Chandra, Giovanna-Ranalli, and Chambers (2010) or Alfons, Templ, and Filzmoser (2010).Alfons et al. (2010) provide the R package simFrame which helps to conduct simulation studies in a reproducible environment.It includes a wide range of features (like data generation, sampling schemes, outlier contamination mechanisms and missing values) to conduct simulation studies.simFrame was originally developed for simulations in the context of survey statistics but is now designed to be as general as possible (cf.Alfons et al. 2010).Furthermore the package simPop (Meindl, Templ, Alfons, Kowarik, and with contributions from Ribatet M 2014) supports the generation of synthetic population data.This can be a suitable environment in scenarios where the reproducibility of results and confidentiality issues play an important role.
Survey statistics are used, for example, in order to deliver specific indicators as a basis for economic and political decision processes.Of special interest here are regional or group specific comparisons (cf.Schmid and Münnich 2014).Surveys which are utilized to report these regional indicators, however, are generally designed for larger areas.Hence sample information on more detailed levels, e.g.municipalities, is hardly available so that classical estimation methods (direct estimators) may lead to high variances of the estimates (cf.Ghosh and Rao 1994).In this case small area estimation methods may reveal highly improved results for the target estimates.Small area estimation has become more and more attractive over the last decade: In 2002, small area estimation (SAE) was flourishing both in research and applications, but my own feeling then was that the topic has been more or less exhausted in terms of research and that it will just turn into a routine application in sample survey practice.As the past 9 years show, I was completely wrong; not only is the research in this area accelerating, but it now involves some of the best known statisticians... Pfeffermann (2013) However, simulation studies in the context of small area estimation are often presented very briefly.Thus there is a need to have a suitable framework to guarantee the reproducibility of analysis.To the best of our knowledge, there is no R package or framework adjusted for the special case of small area estimation which provides a simulation environment.
The aim of this article is to introduce a new R package, saeSim, which supports the process of making simulation studies in the field of small area estimation reproducible.To be more precise, the suggested package has three main objectives: First, to provide tools for data generation.Second, to unify the process of simulation studies.Third, to make the source code of simulation studies available, such that it supports the conducted research in a transparent manner.
This paper is organised as follows.In Section 2 we give a short introduction to small area estimation focusing mainly on unit-level (Battese, Harter, and Fuller 1988) and area-level models (Fay and Herriot 1979).Section 3 introduces a framework for simulation studies and how it is supported by the R package saeSim.To illustrate some of the features of the package we present a model-and design-based simulation study in Section 4. We conclude the paper in Section 5 by summarising the main findings and by providing some avenues for further research.

Small area estimation
The aim of small area estimation is to produce reliable statistics (means, quantiles, proportions, etc.) for domains where few or no sampled units are available.Groups may be areas or other entities defined, for example, by socio-economic characteristics.The demand for such estimators is increasing as they are used for fund allocation, educational and health programs (Pfeffermann 2013).As direct estimation of such statistics are considered to be unreliable (with respect to MSE), methods in small area estimation try to improve the domain predictions by borrowing strength from neighbouring or similar domains.This can be achieved by using additional information from census data or registers to assist the prediction for non-sampled domains or domains with small sample sizes.
For the purpose of this article we will introduce two basic models frequently used in small area estimation, the unit-level model introduced by Battese et al. (1988) and the area-level model introduced by Fay and Herriot (1979).The unit level model (Battese et al. 1988) can be expressed as: where i = 1, . . ., D and j = 1, . . ., n i .The population of size N is divided into D nonoverlapping small areas of sizes N i and into n sampled and N − n non-sampled units, denoted by s and r respectively.y ij is the dependent variable for domain i and unit j, and x ij are the corresponding auxiliary information for that unit.Furthermore v and e are independent.Let β denote the best linear unbiased estimator (BLUE) of β and vi the best linear unbiased predictor (BLUP) of v i (cf.Henderson 1950or Searle 1971).The empirical best linear unbiased predictor (EBLUP) for the mean in small area i in the Battese-Harter-Fuller model is then given by Due to reasons of confidentiality unit-level information is not always available.Instead only aggregates for the domains or direct estimators may be supplied.In this case the feasible direct estimates are known to be unreliable in the case of small sample sizes.Here area-level models can be valuable.The area-level model introduced by Fay and Herriot (1979) is built on a sampling model: where y i is a direct estimator of a statistic of interest µ i for an area i.The sampling error e i is assumed to be independent and normally distributed with known variances σ 2 e,i , i.e. e i |µ i ∼ N (0, σ 2 e,i ).The model assumes a linear relationship between the true area statistic µ i and some auxiliary variables x i : The model errors v i are assumed to be independent and normally distributed, i.e. v i ∼ N (0, σ 2 v ).Furthermore e i and v i are assumed to be independent.Combining the sampling model and the linking model leads to: (2) The Fay-Herriot (FH) model in ( 2) is effectively a random-intercept model where the distribution of the error term e i is heterogeneous and known.The EBLUP of the small area mean in the FH model is given by In this section we will present the simulation framework implemented in saeSim.The framework relies strongly on the idea to describe a simulation as a process of data manipulation.Independent of simulation studies, Wickham and Francois (2015) and Wickham (2014) strongly promote this idea by providing tools for cleaning and transforming data.In those frameworks every defined function takes a data.frameas input and returns it modified.This leads to a natural connection between all defined functions since the result of one function can be directly passed to the next as an argument.The symbioses of these packages with the pipe operator (%>%) from the package magrittr (Bache and Wickham 2014) only emphasizes the process of data manipulation.
To avoid nested function calls the operator can be used to improve the readability as expressions can be read from left to right (cf.Section 4).
In saeSim we extend this approach to simulation studies in the field of small area estimation.The main focus lies in the description of a simulation as a process of data manipulation.Each step in this process can be defined as a self contained component (function) and thus can be easily replaced, extended and most importantly reused.Before we go into the details of the functionality of the package we discuss the process behind simulation studies and how saeSim maps this process into R.
Simulation studies in small area estimation address three different levels; these are the population, the sample and data on aggregated level.Figure 1 illustrates these levels.The left column describes the steps of data manipulation, the right column presents the function names to define the corresponding steps.The population-level defines the data on which a study is conducted and may be a true population, synthetic population data or randomly generated variates from a model.We see two different points of view to define a population: Firstly, design-based simulations, which means that a simulation study is based on true or synthetic data of one population.Secondly, model-based simulations, which have changing random populations drawn from a model.
The scope of this article is not to opt for viewpoints.The aim is to incorporate the different simulation concepts in a common framework.The base (first component in Figure 1) of a simulation study is a data table; here the question is whether these data are fixed or random over simulations.Or from a more technical point of view, is the data generation (the second step in Figure 1) repeated in each simulation run or omitted.Depending on the choice of a fixed or random population it is necessary to re-compute the population domain-statistics like domain means and variances, or other statistics of interest (third component in Figure 1).
The sample-level is necessary when domain predictions are conducted for unit-level models.Independently of how the population is treated -whether as fixed or random -this phase consists of two steps: Firstly, drawing a sample according to a specific sampling scheme.
Secondly, conducting computations on the samples (fourth and fifth component in Figure 1).Given a sample, small area methods are applied.Of interest are, for instance, estimated model parameters, domain predictions or measures of uncertainty (MSE) for the estimates.
Since the sample-level is necessary when unit-level models are applied, the aggregate-level is conducted when area-level models are applied (the seventh and last component in Figure 1).Area-level models in small area estimation typically only use information available for domains (in contrast to units).Thus the question for simulation studies for area-level methods is whether the data are generated on unit-level and is used after the aggregation (sixth component in Figure 1) or whether the data are generated directly on area-level, i.e. drawn from an area-level model.Depending on whether or not unit-level data and sampling are part of the simulation process, the aggregate-level follows the generation of the population or is based on the aggregated sample.
Depending on the topic of research, some steps in this simulation framework can be more relevant than others.From our perspective, these steps are more a complete list of phases one can conduct.Single components may be omitted if not relevant in specific applications.
For example data generation is not relevant if you have population data, or the sample-level is not used, when the sample is directly drawn from the model.
Seen this way, saeSim maps the different steps into R. Two layers with separate responsibilities need to be discussed.The first is how different simulation components can be combined, and the second is when they are applied.Regarding the first, in saeSim we put a special emphasis on the interface of each component.To be precise, we use functions which take a data.frameas argument and have a data.frameas return value.The return value of one component is the input of the next.This definition of interfaces is used for all existing tools in saeSim.The second column in Figure 1 shows how the different steps in a simulation can be accessed.It is important to note that the functions in Figure 1 control the process, the second layer, i.e. when components are applied.Each of these functions take a simulation setup object to be modified and a function with the discussed interface as arguments.Hence, the pipe operator (%>%) can be used to combine separate components to a simulation setup.

Case studies
We present two applications of saeSim, one model-based simulation in Section 4.1 and a design-based simulation in Section 4.2.First, though, we introduce some basic functionalities as the pipe operator (%>%) needs some explanation.The pipe operator is designed to make otherwise nested expressions more readable as a line can be read from left to right, instead from inside out (Bache and Wickham 2014).As a simple example see the following lines which are equivalent with respect to their functionality: > library("magrittr") > colMeans(matrix(rnorm(10), ncol = 2)) > rnorm(10) %>% matrix(ncol = 2) %>% colMeans In saeSim, we rely on this operator.Although all functions can be used without it, we strongly recommend its usage.The following example shows some of the aspects of the package: > library("saeSim") > setup1 <-sim_base_lm() %>% sim_sample(sample_number( 5)) > setup2 <-sim_base_lm() %>% sim_sample(sample_fraction(0.05)) Without knowing anything about the setup defined in sim_base_lm() we notice that setup1 and setup2 only differ in the applied sampling scheme.sim_sample() is responsible as a control when a function is applied (after the population-level) and sample_number(5) and sample_fraction(0.05)define the explicit way of drawing samples.Separating the responsibility of each component into what is applied and when it is applied makes it possible to add new components to any step in the process.The composition of a simulation in that manner will focus on the definition of components and hide control structures.Any function can be passed to sim_sample() which has a data.frameboth as input and as return value.
The only responsibility of that function is to draw a sample, which makes it easy to find, understand and reuse when published.The operator %>% is used to add new components to the setup.

Model-based simulation
In the following we show one way how to construct a simulation in a model-based setting.
The aim is to estimate the domain predictions under a FH model.Involved components are data generation and computing on aggregated data (cf.Figure 1).The first step is to generate the data under the model: where ) and e i indep ∼ N (0, σ 2 i ) with σ 2 i = 0.1, 0.2, . . ., 4 and i = 1, . . ., 40 as index for the domains.x i , v i and e i are independent from each other.The area-level data for the simulation are generated in each Monte Carlo replication.
In this case the base-component is a data frame with an id variable named idD and constructed with the function base_id().Any random number generator in R can be used.However, we have normally distributed variates, for which some predefined functions are available in the package.For the reproducibility of the following results we also set the seed to 1.The seed is not part of a simulation setup in saeSim but needs to be defined by the researcher.
> set.seed(1)> setup <-base_id(nDomains = 40, nUnits = 1) %>% + sim_gen_x(mean = 0, sd = 4) %>% + sim_gen_v(mean = 0, sd = 1) > setup To add the area-level predictions from a Fay-Herriot model we need to define another component.The function takes a data.frameas input and returns the modified version.For the estimation of the EBLUP under the FH model we use the function eblupFH() from the package sae (Molina and Marhuenda 2013).To avoid naming conflicts between the dependencies of sae and the package dplyr (Wickham and Francois 2015) we make use of the double colon operator in order to call the function eblupFH() without attaching the package.Hence we define a function named comp_FH() and add it to the process: An additional variable idR is automatically added as an ID-variable for the iteration as well as a variable simName to distinguish between scenarios.In saeSim we do not provide further tools to process the resulting data as there are many tools readily available in R. In the design-based scenario we show how to process the result data into graphs with only a few lines of code.Using this data set as population, we repeatedly draw samples from it.Then we predict the domain means by using a direct estimator and a unit-level model.The sampling design is to draw a 10 per cent sample from each region with simple random sampling.For each region the direct estimator for income and the EBLUP under the BHF model is computed.Although the data offer some more information, we only use gender, age and agesq as covariates.The function eblupBHF() from the package sae is an implementation of the BHF estimator.This function takes three data objects and returns the domain predictions.The three objects are the sampled data, the population means of the auxiliary variables and the population sizes in each domain.

Design-based simulation
Before we begin to construct the simulation setup, we store these data frames as attributes to the population data.This allows us to process meta data alongside the main data frame.It is important to note that not all functions for manipulating data frames in R preserve these attributes.Users of saeSim have to keep this in mind when they implement new functions.
Defining the interfaces between components differently is one possibility to avoid the usage of attributes.This can be done, for example, by using generic vectors or S4 classes instead of data frames.However this will add complexity to the process of data manipulation underlying the package which we try to avoid by following the paradigm: data frame in, data frame out.Thus all functions in saeSim preserve the attributes of the main data frame.

Outlook
From our perspective, there is a need for sharing tools for data generation and simulation among the scientific community in order to guarantee the reproducibility of research.saeSim may provide an adequate framework for pursuing this aim in the field of small area estimation.By defining the steps of a simulation we may promote a reasonable way to communicate results in academic articles and during research.
The package source is available on CRAN (http://CRAN.R-project.org/package=saeSim) and the repository for development on GitHub (https://github.com/wahani/saeSim).As GitHub allows to share and contribute source code using version control, it is open for submissions.Apart from the availability of specific utility functions, we may promote and support the design of source code for simulation studies.One aspect is the design of simulations as processes of data.Furthermore, we encourage the definition of small and self contained components, i.e. functions.This reduces the lines of code necessary to be read in order to understand its purpose.
The package provides more features than are introduced in this article.One may mention in this context the support of outlier contaminated data.Currently only representative outliers (outlying observations in the population) are supported (cf.Chambers 1986).However, we plan to extend this feature to non-representative outliers (outliers are part of the sample but not the population, e.g., incorrectly recorded values).Furthermore, an interface to random number generators in R is available.The user can also generate group effects as needed in mixed models.As the response is created by an R expression, any form of non-linearity in the relationship between response and auxiliary variables as well as error components can be modeled.
A more technical feature is a back-end for parallel computations which is a link to the paral-lelMap package in R (Bischl and Lang 2015).Tools to process result data after the simulation, i.e. summaries or plotting methods, are avenues for further research.Already available are some simple plots for the simulation setups as well as a summary method to get information on the expected run time and structure of the resulting data.

Figure 1 :
Figure 1: Process of simulation.Left column are the steps in a simulation.Right column are the corresponding function names to represent those steps in R.

Figure 3 :
Figure 3: Monte Carlo MSE of direct vs.BHF predictor.50 predictions for each region and estimation technique.
Note that if you print a simulation setup to the console, as in the above example, one simulation run is performed and only the first rows (the head) of the resulting data table are printed.This enables interactivity with the object itself; however, it hides the fact that the setup object is a collection of functions to be called.In this model the error component e i has different variances which is not covered by a predefined function.Thus, as a generator component, we define a function which takes a data.frameasinput and returns it after adding a variable named vardir with the variances and the variable e with the generated random numbers:The last step in data generation is to construct the response variable which is named y and is added to the data.Furthermore, we add the true area statistic under the model to the data: The object setup stores all necessary information to run one iteration of the simulation.In the following R = 100 repetitions are performed.The result is a list of data.frames.The function bind_rows() from the package dplyr is used to combine the resulting list: (Alfons et al. 2010)simulation we illustrate the use of saeSim under a fixed population.For this purpose we use a synthetic population generated from Austrian EU-SILC (European Union Statistics on Income and Living Conditions) data.The data consist of 25 thousand households.It is published alongside the R package simFrame(Alfons et al. 2010)where it is also used as an example data set.To keep this study as simple as possible, we further restrict the data to the main income holder and only use some of the available auxiliary information.
(Wickham 2007)cess the simulation results we present two plots using the package ggplot2(Wickham 2009) and reshape2(Wickham 2007)for further reshaping of the data.Figure2and 3 show the Monte Carlo BIAS and MSE for each region in Austria and for each estimator.Keep in mind that the BHF was applied to illustrate the simulation framework.There are a number of issues with regard to model choice, variable selection and outliers, which we will not discuss in this context.