\documentclass[article]{ajs}
\usepackage{amssymb, amsmath}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% declarations for jss.cls %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%% almost as usual
\author{J\"org Drechsler\\Institute for Employment Research \And
        Hans Kiesl\\OTH Regensburg \And
        Matthias Speidel\\Institute for Employment Research}
\title{MI Double Feature: Multiple Imputation to Address Nonresponse and Rounding Errors in Income Questions}

%% for pretty printing and a nice hypersummary also set:
\Plainauthor{J\"org Drechsler, Hans Kiesl, Matthias Speidel} %% comma-separated
\Plaintitle{MI Double Feature: Multiple Imputation to Address Nonresponse and Rounding Errors in Income Questions} %% without formatting
\Shorttitle{MI Double Feature} %% a short title (if necessary)

%% an abstract and keywords
\Abstract{
  Obtaining reliable income information in surveys is difficult for two reasons. On the one hand, many survey respondents consider income to be sensitive information and thus are reluctant to answer questions regarding their income. If those survey participants that do not provide information on their income are systematically different from the respondents -– and there is ample of research indicating that they are -– results based only on the observed income values will be misleading. On the other hand, respondents tend to round their income. Especially this second source of error is usually ignored when analyzing the income information.\\
In a recent paper, \citet{dre:kie:2014} illustrated that inferences based on the collected information can be biased if the rounding is ignored and suggested a multiple imputation strategy to account for the rounding in reported income. In this paper we extend their approach to also address the nonresponse problem. We illustrate the approach using the household income variable from the German panel study ``Labor Market and Social Security''.
}
\Keywords{heaping, measurement error, multiple imputation, nonresponse, poverty rate}
%\Plainkeywords{keywords, comma-separated, not capitalized, R} %% without formatting
%% at least one keyword must be supplied

%% publication information
%% NOTE: Typically, this can be left commented and will be filled out by the technical editor
%% \Volume{50}
%% \Issue{9}
%% \Month{June}
%% \Year{2012}
%% \Submitdate{2012-06-04}
%% \Acceptdate{2012-06-04}
%% \setcounter{page}{1}
\Pages{1--xx}

%% The address of (at least) one author should be given
%% in the following format:
\Address{
 J\"org Drechsler\\
  Department for Statistical Methods\\
Institute for Employment Research\\
  Regensburger Str. 104, 90478 Nuremberg, Germany\\
  E-mail: \email{joerg.drechsler@iab.de}
}
%% It is also possible to add a telephone and fax number
%% before the e-mail in the following format:
%% Telephone: +43/512/507-7103
%% Fax: +43/512/507-2851

%% for those who use Sweave please include the following line (with % symbols):
%% need no \usepackage{Sweave.sty}

%% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\begin{document}

%% include your article here, just as usual

\section{Introduction}
Reliable information on individual and household income is difficult to obtain. Most administrative data sources contain only specific sources of income such as income from earnings or program participation and often only cover a subset of the population (self-employed are usually not included). Thus, most agencies rely on household surveys to collect information on total income. However, inferences based on the collected income information might be biased for two reasons: First, income is considered sensitive information and many survey participants are reluctant to answer questions on their personal income. Second, most respondents do not remember their exact income, especially if they are asked to provide an estimate for their total income including income from earnings, assets, transfers, etc. Respondents often round their income in this case, implicitly incorporating their uncertainty regarding the true value.

Nonresponse can bias inferences if the respondents are systematically different from the nonrespondents. For example, it seems plausible to assume that younger survey respondents are less concerned with confidentiality violations and the protection of sensitive information (``generation Facebook'') and thus, their response rates to income questions will be higher. Since income usually increases with age, individuals with lower income will be over-represented among the respondents in this case and estimates of the average income of the population will be underestimated if only the observed income values are used.

To reduce the risk of nonresponse bias, many surveys try to obtain at least partial income information for those survey participants that are unwilling or unable to provide exact income information by asking whether the income lies in certain pre-specified intervals. Often subsequent questions further narrow down the interval in which the true income falls. Figure \ref{fig:inc_tree} provides an example how (partial) income information is collected in the German panel study ``Labor Market and Social Security'' (PASS) \citep{tra:2010}. Respondents are fist asked for an estimate of their total household income. If they are unwilling or unable to provide this information, the interviewer provides a first threshold ($1,000$ euros) and asks whether the income is above or below that threshold. Depending on the answer to this question the survey participant is asked to choose from three specific intervals (if the respondent reported an income below $1,000$ euros for the first question) or a new threshold ($3,000$ euros) is provided and the respondent is asked again whether his or her income is above or below this threshold. If the respondent provides an answer to the second threshold question, three different income intervals are offered for both response options and the respondent is asked to pick the interval in which his or her income falls. Figure \ref{fig:inc_tree} illustrates the decision steps and the corresponding income intervals that are implied by the responses to each of the questions. The interview process could terminate in any of the nodes of the decision tree. For example, a respondent might refuse to provide the exact income information but might be willing to provide the information that his or her income is largen than $1,000$ but less than $3,000$ euros. However, he or she might be unwilling to further specify whether the income is in the interval  $[1000,~1500[$ or $[1500, ~2000[$ or $[2000, ~3000[$.

	\begin{figure}
 			\includegraphics[width=\textwidth]{Questiontree_07.pdf}
				\caption{Implied income intervals based on partial income information collected from respondents unwilling to provide their exact income.}
				\label{fig:inc_tree}
		\end{figure}
Asking those respondents that are unwilling to provide their exact income for information regarding the interval in which their income falls is a successful strategy to reduce the nonresponse rate. For example, in wave six of the PASS survey, $76.96\%$ of the respondents who are unwilling or unable to provide their exact income provided some information on the interval in which their income falls, reducing the initial nonresponse rate from $4.56\%$ to $1.05\%$.

Following this procedure, the collected income information consists of exact information for those respondents that are willing to answer the exact income question and interval information of different lengths for those individuals that answer (some of) the interval questions. Directly obtaining valid inferences from this type of data is not straightforward, especially if refusal to answer any of the income questions should also be taken into account. In this paper we will present an imputation approach that simplifies the analysis of the collected income data. The multiple imputation methodology is not only used to impute the missing values; plausible exact income values are also generated for those respondents that only provided interval information regarding their income. The obtained imputed income data can be analyzed as if the exact income would have been obtained for all respondents. The additional uncertainty implied by the fact that only partial information is available for some of the respondents is correctly reflected through the multiple imputation procedure.

The negative effects of nonresponse are well known. However, the impacts of heaping, i.e., rounding to certain numbers such as multiples of 5, 10, 100, etc., are less studied. Rounding is a common phenomenon in surveys. Most quantitative variables such as questions on expenditure or subjective beliefs (\textit{How likely is it that...}) show some form of rounding \citep{man:mol:2010}. But also questions on timing of events \citep{hut:1990} or smoking behavior typically are affected \citep{wan:2008}. In a recent experimental study \citet{ruu:2013} demonstrated that the amount of rounding increases with the level of uncertainty the respondent feels regarding the quantity he or she is asked for. Regarding questions on income the level of uncertainty is usually very high. Most respondents do not know their income from earnings to the exact euro amount (especially if the earnings before taxes is requested) and exact values for other sources such as monthly income from savings are even more difficult to provide. Thus, it is not surprising that questions on income usually show a large degree of rounding. Table \ref{PASS_round} provides the percentage of the reported monthly income values that are divisible by a given round number obtained from the PASS survey for the year 2008/2009 (see Section \ref{PASS_describe} for a description of the survey). It seems that most of the reported data are rounded to some extent. More than 60\% of the reported income values are divisible by 100 and only about 15\% of the data are not divisible by 5.
\begin{table}
\caption{\label{PASS_round}Percentage of reported monthly household income values that are divisible by a given round
number in the PASS survey for the year 2008/2009.}
\centering
\begin{tabular}{lrrrrrr}
   \hline\noalign{\smallskip}
Income divisible by	&   1,000&	500&	100&	50&	10&	5\\
Relative frequency (\%)&13.97&	23.94&	61.57&	69.58&	80.71&	84.13\\
%MEINE ZAHLEN FUER WELLE 3 Relative frequency (\%)&13.37&	23.48&	60.82&	68.86&	81.59&	85.10\\
   \hline\noalign{\smallskip}
\end{tabular}
\end{table}

\citet{dre:kie:2014} illustrate that heaping in income data can cause substantial bias in important measures such as the poverty rate. They also suggest a strategy for dealing with the problem and demonstrate its merits through simulations and real data applications. The basic idea is to model the rounding behaviour given the reported income value and then to replace the reported value by multiple plausible candidates for the true value that would have been observed if the respondent had not have rounded his or her income. A related idea has been proposed by \citet{hei:rub:1990} for heaped age data and has later been applied in a number of papers to model the smoking behaviour based on reported cigarette counts \citep{hei:1994,wan:2008,wang:2012}. The major advantage of the approach is that the imputed values can be treated as true values in any analysis following the imputation, i.e., it is not necessary to develop adjustment methods for each type of analysis separately. The analyst only needs to repeat the analysis of interest on each imputed dataset using standard analysis techniques. The final inferences are obtained using standard multiple imputation combining rules \citep{rub:78,rub:87}.

In this paper we extend the approach by \citet{dre:kie:2014} in order to address (partial) nonresponse and heaping simultaneously. We review the approach of \citet{dre:kie:2014} in Section 2 and discuss the necessary extensions to incorporate the interval information and to adjust for nonresponse in Section 3. In Section 4 we illustrate the approach based on data from the German panel study “Labor Market and Social Security”. The paper concludes with some final remarks.

\section{Strategies to adjust for rounding errors}\label{round}
This section discusses the imputation approach suggested by \citet{dre:kie:2014} which itself is based on an idea by \citet{hei:rub:1990}. In their paper \citet{hei:rub:1990} proposed to use multiple imputation to correct for heaped reported age values of young children in Tanzania. The section borrows heavily from \citet{dre:kie:2014} and we refer the reader to this paper for a more detailed discussion of the methodology.

To obtain imputed income values that are adjusted for potential rounding, we need two models: one for the true income and one for the rounding behaviour. Following common practice, we model the distribution of the household income $Y$ by a log-normal distribution (see, for example, \citet{cle:gal:2005} for a motivation for this model):
\begin{equation}\label{logN}
	\log(Y) |X \sim N(X'\beta,~\sigma^2).
\end{equation}
We only consider rounding to the nearest multiple of $c$, which corresponds to the rounding function $f_c: x\mapsto c \cdot \lfloor x/c + 1/2\rfloor $ and which we call rounding of degree $c$. No rounding at all will be called rounding of degree 0. We assume that there are $p$ possible degrees of rounding $c_1 < ... < c_p$. Typically, the set of $c_i$'s consists of values such as 0, 1, 5, 10, 50, 100. For a given household, our model for the degree of rounding is an ordered probit model, i.e., we assume a normally distributed latent variable $G$ which may (linearly) depend on the logged income $\log(Y)$ and some covariates $Z$ (where some or all components of $Z$ might be in $X$ and vice versa):
\begin{equation*}
G | \log(Y), Z  \sim N(\gamma_0 + \gamma_1 \cdot \log(Y) + Z' \gamma_2, ~\tau^2)
\end{equation*}
Rounding of degree $c_1$ occurs, if $G < k_1$; rounding of degree $c_i$ $(1<i<p)$ occurs, if $G \in [k_{i-1}, k_{i}[$; rounding of degree $c_p$ occurs, if $G \geq k_{p-1}$. The $p-1$ threshold values $k_1 < k_2 <...<k_{p-1}$ are unknown model parameters.

We assume that given $X$, $\log(Y)$ and $Z$ are independent, and analogously, given $Z$, $G$ and $X$ are independent. Thus, $\log(Y)$ and $G$ have a bivariate normal distribution given $X$ and $Z$:

\[
\log(Y),G | X,Z\sim N(\mu,~\Omega), \quad \text{where}
\]
\begin{eqnarray}
 \mu&=&\left(\begin{array}{c}
                                    X'\bf{\beta}\\
                                    \gamma_0+ X'\gamma_1\beta+Z'\gamma_2\end{array}\right), \label{biv1}\\
 \Omega&=&\left(\begin{array}{cc}
                                \sigma^2 &\gamma_1\sigma^2\\
                               \gamma_1\sigma^2&\tau^2+\gamma_1^2\sigma^2  \label{biv2}\\
                              \end{array}\right).
\end{eqnarray}

To impute true income values based on these models, it is necessary to derive the likelihood for all the unknown parameters $\Psi=(\beta,~\sigma^2,~\gamma_1,~\gamma_2, ~k_1, ~..., ~k_{p-1})$ (we need to fix $\gamma_0$ at $0$ and $\tau^2$ at $1$ to make the ordered probit model identifiable). Let $s_i$ be the observed income of household $i$. It can be shown that this likelihood is given as (see \citet{dre:kie:2014} for details)
\begin{eqnarray}\label{lik}
\notag L(\Psi|s,~x,~z)&=&\prod_{i} f(s_i,~x_i,~z_i|\Psi)\\
       &=&\prod_{i}f(x_i,~z_i) \cdot \prod_{i}f(s_i |x_i,~z_i, ~\Psi)  \\
\notag       &\propto & \prod_i \iint\limits_{A(s_i)} f(g,~\log(y)|x_i,~z_i,~\Psi) d\log(y)dg,
\end{eqnarray}
where $A(s_i)$ is the set of $(g, ~\log(y))$ that are consistent with an observed $s_i$.

Maximizing this likelihood will provide the parameter vector $\Psi$ necessary for the imputations. To approximate a draw from the posterior distribution of $f(\Psi|s,~x,~z)$ under the assumption of flat priors for all parameters, we can draw from
\[
\Psi^{*}\sim MVN(\hat{\Psi}_{ML},~I(\hat{\Psi}_{ML})),
\]
where $\hat{\Psi}_{ML}$ contains the maximum likelihood estimates of $\Psi$, and $I(\hat{\Psi}_{ML})$ is the negative inverse of the Hessian matrix of the log-likelihood with $\hat{\Psi}_{ML}$ plugged in.\\
To impute exact income values, \citet{dre:kie:2014} suggest a simple rejection sampling approach:
\begin{enumerate}
\item
Draw candidate values for $(\log(y_i)^{imp},~g_i)$ from a truncated bivariate normal distribution with mean vector (\ref{biv1}) and covariance matrix (\ref{biv2}) (using parameters from $\Psi^{*}$), where the truncation points are given by the maximal possible degree of rounding given the observed income $s_i$ (for example, for an observed income value $850$ with possible degrees of rounding 1, 5, 10, 50, 100, 500, and 1000, $\log(y_i)$ is bounded by $\log(825)$ and $\log(875)$ and $g_i$ has to be in  $]{-\infty}, ~k_4^{*}[$).
\item
Accept the drawn values if they are consistent with the observed rounded income, i.e., rounding the drawn income value according to the drawn rounding indicator gives the observed income $s_i$, and impute $\exp(\log(y_i)^{imp})$ as the exact income value.
\item
Otherwise draw again.
\end{enumerate}
Repeating this procedure $m$ times provides $m$ imputed datasets that properly reflect the uncertainty from imputation.

\section{Extensions for (partial) nonresponse}\label{simple_imp}
As discussed in the introduction, many agencies ask respondents who refuse to answer the exact income question whether they would be willing to provide information in which given interval their income falls. This partial information can be used to improve the inferences regarding the income variable. In this paper we suggest to use this partial information when setting up the likelihood and then to impute plausible true income values for each reported income interval. The approach is related to the approach to account for rounding described in the previous section with the only difference that the interval in which the true income must fall is known in advance and does not need not be estimated from the observed data.

Let $r_i$, $r_i\in \{0,~1,~...,~R+1\}$, be a random variable that identifies to which income response group individual $i$, $i=1,...,n$ belongs. Let $r_i=0$ represent exact income information (which might still be affected by rounding) and let $r_i=1,~...,~R$ identify the $R$ different income intervals that could be selected from the predefined intervals provided by the agency. For example, according to Figure \ref{fig:inc_tree} $R=13$ in the PASS survey. Finally, let $r_i=R+1$ represent refusal to provide any income information at all. Let $I^r_i$ be an indicator function that equals 1 if individual $i$ belongs to income response group $r$ and equals 0 otherwise. Let $l^r$ and $u^r$ be the upper and lower bound of the income interval for response group $r$. We set $l^0=y=u^0$ and $l^{R+1}=-\infty$ and $u^{R+1}=+\infty$. All other bounds are defined by the income intervals provided by the agency. We extend the definition of $s_i$ to also include all reported income intervals, i.e., $s_i$ is a single value for all individuals that reported the exact income, but is an interval for all individuals that only provided the information in which interval their income falls. The extended likelihood that also takes the interval information into account is given by

\begin{eqnarray}\label{eq:extLogLi}
 L(\Psi|s,~x,~z)&=&\prod_{i}f(x_i, ~z_i) \cdot \prod_{i}f(s_i |x_i, ~z_i, ~\Psi)  \\
\notag       &\propto & \prod_i \{(\iint\limits_{A(s_i)} f(g, ~\log(y)|x_i, ~z_i, ~\Psi) d\log(y)dg)^{I^0_i}\\
\notag       &\cdot&\prod_{r=1}^{R+1} [F(\log(u^r_i), ~\mu_i=x_i'\beta, ~\sigma^2)-F(\log(l^r_i),~\mu_i=x_i'\beta, ~\sigma^2)]^{I^r_i}\}.
\end{eqnarray}

Once estimates for all parameters are obtained by maximizing the likelihood in (\ref{eq:extLogLi}), imputation of the plausible values for the true income $Y$ is straightforward. The first imputation step is similar to Section \ref{round}: Approximate a draw from the posterior distribution of the parameters by drawing from a multivariate normal with mean equal to the maximum likelihood estimates of the parameters and variance equal to the negative inverse of the Hessian matrix of the log-likelihood. The second step depends on the type of data that is imputed.
The true income for all exact reporters is imputed as described in Section \ref{round} to account for potential rounding in the reported income values. The true income for the interval respondents is imputed by drawing from a truncated normal distribution $N_t(\mu,\sigma^2)$ with $\mu=X'\beta^{*}$, $\sigma^{2}=(\sigma^{*})^2$, where $\beta^*$ and $(\sigma^*)^2$ are the drawn parameters from step one. The truncation points are given by the bounds of the reported income interval. Finally, imputations for those respondents that refused to provide any information regarding their income are obtained by drawing from a normal distribution with parameters $\mu=X'\beta^{*}$ and $\sigma^2=(\sigma^{*})^2$.

\section{Application to the panel study Labor Market and Social Security}\label{PASS_describe}
We illustrate the application of our approach using data from the German panel study ``Labor Market and Social Security''. To enable a comparison of our extended approach with the approach of \citet{dre:kie:2014} that only focuses on rounding, we use the same models for the income and rounding behaviour and also use the poverty rate to evaluate which impacts the adjustments have on important measures that are regularly computed from income data. We start with a description of the data and a short summary of the imputation models borrowed from \citet{dre:kie:2014}. The interested reader is referred to this paper for more details.

The panel study ``Labor Market and Social Security'', started in 2006 and conducted yearly ever since, aims at measuring the social effects of labour market reforms. The survey consists of two different samples, each containing roughly 6,000 households. The first sample is drawn from the Federal Employment Agency's register data containing all persons in Germany receiving unemployment benefit for long time unemployment. The second sample is drawn from the MOSAIC database of housing addresses collected by the commercial data provider, microm. This sample is representative for the resident population in Germany. The stratified sampling design for this sample oversamples low-income households. The major benefit of this combination of two different samples lies in the fact that control groups for the benefit recipients can easily be constructed. The panel contains a large number of socio-demographic characteristics (for example, age, gender, marital status, religion, migration background), employment-related characteristics (for example, status of employment, working hours, income from employment, employment history), benefit-related characteristics (for example, benefit history, amount of benefits, participation in training measures), and subjective indicators (for example, fears and problems, employment orientation,  subjective social position). A detailed description of the survey can be found in \citet{tra:2010}.

To model the true income, we assume a log-normal distribution for income conditional on a set of covariates $X$. Details about the covariates included in the model are contained in Table~\ref{covariates}.
\renewcommand{\baselinestretch}{1}
\begin{table}[t]
\caption{\label{covariates}Covariates included in the income model.}
\centering
\begin{tabular}{ll}
   \hline\noalign{\smallskip}
variable & characteristics\\
   \hline\noalign{\smallskip}
household size 	&5 categories (household sizes$>4$ set to ``5 or more'')\\
deprivation index &	range: 0--21\\
living space 	  &	range: 7--903 square meters\\
type of household &	8 categories\\
amount of debt   &	7 categories\\
income from savings &	yes/no\\
age of respondent  &	range: 15--99\\
amount of savings  &	8 categories (not available for wave 1)\\
unemployment benefits  & 	yes/no\\
%federal state 	&	16 categories\\
%type of the municipality  &	10 categories (not available for wave 1)\\
weight &		range: 24.95--186,000\\

   \hline\noalign{\smallskip}
\end{tabular}
\end{table}
\renewcommand{\baselinestretch}{2}

All variables are standardized, some sparsely populated categories in $X$ are collapsed and influential outliers are removed to ensure convergence of the maximisation procedure (see \citet{dre:kie:2014} for details). For the rounding behaviour, we assume that the tendency to round only depends on the true income.

\subsection{Evaluation of the model assumptions}\label{mod_eval}
Since the proposed rounding adjustment strategy is purely model based, an evaluation of the model assumptions is essential. We follow the approach of \citet{dre:kie:2014} to check whether the model assumptions are reasonable. They suggest to use posterior predictive simulations \citep[Chap. 6]{gcrs:2004} for the evaluations since the true income and the rounding behaviour are never observed which complicates the evaluation.

\subsubsection{The income model}
For the income model evaluation we generate a very large number of imputations for the true income based on the parameters obtained from maximizing the likelihood in (\ref{eq:extLogLi}). The rounding behaviour is completely ignored here, i.e., imputations are generated for all observations based on the marginal income model described in (\ref{logN}). The obtained imputations can be seen as samples from the posterior predictive distribution of the income for each observation according to the model. To evaluate the model fit we can check whether these posterior distributions cover the observed income values from the original data. Of course many of the observed income values are subject to rounding, so we limit the evaluation to those records for which we can be sure that the reported value is only rounded to the next euro (i.e., all records for which the reported value is not divisible by 5). If the imputation model is correct, the true (observed) income should be covered in the region between the empirical $\alpha/2\%$ quantile and the $1-\alpha/2\%$ quantile of the imputed values with a probability of $1-\alpha$. Thus, as a measure for the model fit we calculate the fraction of unrounded income values from the observed data that are covered by this interval computed from the imputed values and
compare this fraction to the expected coverage rates. Results based on $m=1,000$ imputations are presented in Table \ref{inc_model}. The empirical coverages are always very close to the nominal coverages indicating a good fit for the income model.


\renewcommand{\baselinestretch}{1}

\begin{table}[t]
\begin{small}
\begin{center}
\caption{\label{inc_model}Percentage of true income values from the PASS survey that are covered in the defined regions of the posterior distribution of the imputed income values.}
\centering
\begin{tabular}{ccccccc}
   \hline\noalign{\smallskip}
   Expected     & \multicolumn{6}{c}{Empirical Coverage (in \%)}\\
   \hline\noalign{\smallskip}
   Cov. (in \%)                   & wave 1            & wave 2                          &wave 3          &wave 4 & wave 5& wave 6\\
   \hline\noalign{\smallskip}
99.00    &97.98              &94.21 &97.22  &96.95               &95.41  &97.08\\
95.00    &95.37              &92.23 &93.53  &93.69               &92.96  &93.57\\
90.00    &92.37              &89.21 &90.14  &89.26               &89.33  &89.37\\
%75.00    &81.94              &81.06 &79.88  &79.82               &78.70  &80.58\\
   \hline\noalign{\smallskip}
\end{tabular}
\end{center}
\end{small}
\end{table}
\renewcommand{\baselinestretch}{2}
\subsubsection{The rounding behaviour model}
To evaluate the quality of the rounding behaviour model, we repeatedly re-round the imputed (unrounded) income variable and compare it to the originally observed data. Specifically, we repeatedly ($m=100$) generate unrounded income data that are consistent with the original data according to the joint model for income and rounding behaviour. Then, we repeatedly round each of the obtained exact income variables (100 times for each of the generated income variables) according to the rounding probabilities based on the parameters from the rounding behaviour model. Since we have no direct measure for the rounding behaviour we use a proxy for the evaluation. We compare the share of the income values that are divisible by values that are typically used as rounding bases. Table \ref{round_frac} lists these shares for the original data, the re-rounded data (computed as the average across the 10,000 generated datasets) and the unrounded data (computed as the average across the $m=100$ replicates). Each column reports the percentage of records for which the given number represents the maximum possible rounding base, i.e., these records would not be divisible by any of the larger rounding bases listed in the table. The results are pooled across all waves of the PASS data for readability. Similar results were obtained when looking at each wave individually.
\renewcommand{\baselinestretch}{1}

%\begin{table}[t]
%\begin{small}
%\caption{\label{round_frac}WIE BISHER: NUR WELLE 6 Percentage of income values that are divisible by a given round
%number (but not by any of the larger numbers) in the observed PASS data, the unrounded data, and the re-rounded data.}
%\centering
%\begin{tabular}{lrrrrrr}
%   \hline\noalign{\smallskip}
%Income divisible by	  &   5 &	10&	  50 &	 100 &	 500 &	1,000\\
%Observed income (\%)  &3.51 &12.73& 8.04 & 37.34 & 10.11 & 13.37\\
%Unrounded income (\%) &9.77 & 8.00& 1.02 &  0.81 &  0.09 & 0.11\\
%Re-rounded income (\%)&2.67 &13.40& 9.82 & 46.42 &  8.61 & 9.56\\
%   \hline\noalign{\smallskip}
%\end{tabular}
%\end{small}
%\end{table}
%\renewcommand{\baselinestretch}{2}
\begin{table}[t]
%\begin{small}
\caption{\label{round_frac}Percentage of income values that are divisible by a given round
number (but not by any of the larger numbers) in the observed PASS data, the unrounded data, and the re-rounded data.}
\centering
\begin{tabular}{lrrrrrr}
   \hline\noalign{\smallskip}
Income divisible by	  &   5 &	10&	  50 &	 100 &	 500 &	1,000\\
Observed income (\%)  &4.16 &11.89& 7.90 & 36.98 & 9.98 & 13.55\\
Unrounded income (\%) &9.95 & 7.86& 1.02 &  0.80 &  0.09 & 0.10\\
Re-rounded income (\%)&3.14 &12.58& 10.07 & 45.92 &  8.69 & 9.47\\
   \hline\noalign{\smallskip}
\end{tabular}
%\end{small}
\end{table}
\renewcommand{\baselinestretch}{2}
As expected the percentages differ substantially between the observed income and the unrounded income. Most of the values ($80.18\%$) in the unrounded data (second row in the table) are not divisible by any of the numbers and the percentages decrease quickly as the rounding base increases (note that we assume that values in the unrounded data are always rounded to the nearest euro). This is different for the observed data (first row). Only $15.52\%$ of the data are not divisible by any of the given numbers and $36.98\%$ of the records have a maximum rounding base of 100. The divisibility of the re-rounded data (third row) is reasonably close to the observed data. Again, most records are in the category with a maximum rounding base of 100, although the percentage of records that fall into this category is slightly overestimated ($45.92\%$). This overestimation leads to a slight underestimation of the percentage of records that are not divisible (10.11\%) by any of the numbers. For the remaining categories the percentages based on the re-rounded data are fairly close to the percentages based on the observed data indicating a good fit of the rounding behaviour model.

\subsection{Results}
We compare three different approaches to estimate the poverty rates from the six waves of the PASS survey that are available so far. In the first approach we treat the reported income as the true income and only use the information from those respondents that answered the exact income question. This approach assumes that the reported income is never rounded and implies that the respondents to the exact income question are not systematically different regarding their income from those that only provide income intervals or completely refuse to provide any information regarding their income, i.e., the income information is missing completely at random (MCAR) in the terminology of \citet{rub:1976}. In the second approach we use the methodology of \citet{dre:kie:2014} to account for the rounding but still only use the data from respondents who provided an answer to the exact income question, i.e., we still assume MCAR. The final approach is the extended approach described in this paper which also takes the information from the interval respondents into account and imputes the missing income information for those survey participants that completely refused to provide any information regarding their income. We note that this approach uses more information to estimate the parameters in the imputation model and only assumes that the income information is missing at random (MAR), i.e., the missingness can be explained by the covariates included in the imputation model.

We apply the models described above separately for each year (the variable \textit{amount of savings} is not available in the first wave of the survey and is thus excluded from the income model in that year). Observations with missing data in any of the covariates included in the model are deleted for simplicity. We don't expect any impacts on the findings since missing rates in the covariates are low and all three approaches are affected similarly. Incorporating the imputation routine for the true income into a sequential regression multivariate imputation (SRMI, \citet{rag:2001}) procedure to impute missing values in all variables sequentially would be straightforward.

Table \ref{results} presents the poverty rates for the different waves. The estimated poverty rate is based on the disposable income, i.e., the reported income is adjusted for the number of household members and the age of the household members as suggested by the OECD (see for example \citet{euro:2013}). The first column contains the results based on the original rounded data without any adjustments. The second column contains the results for the multiply imputed true income accounting for rounding. The third column contains the results based on all data. All imputation results are based on $m=25$ imputations. The 95\% confidence intervals reported in brackets are based on bootstrap variance estimates.

\renewcommand{\baselinestretch}{1}
\begin{table}[t]
\caption{\label{results}Estimated poverty rates from the PASS survey (with 95\% confidence intervals reported in brackets).}
\centering
\begin{tabular}{lccc}
   \hline\noalign{\smallskip}
Wave & Original data & Rounding adjustment&Nonresponse and\\
&&&rounding adjustment\\
   \hline\noalign{\smallskip}
Wave 1&17.29        &16.35        &16.50\\
      &(15.81;18.77)&(15.14;17.55)&(15.29;17.71)\vspace{0.2cm}\\
Wave 2&16.91        &16.98        &17.05\\
      &(15.79;18.03)&(15.69;18.27)&(15.78;18.39)\vspace{0.2cm}\\
Wave 3&14.27        &15.40        &15.69\\
      &(12.28;16.27)&(13.91;16.90)&(14.30;16.08)\vspace{0.2cm}\\
Wave 4&	14.89       &14.61        &14.51\\
      &(13.44;16.35)&(13.40;15.81)&(13.11;15.92)\vspace{0.2cm}\\
Wave 5&16.34        &15.75        &15.85\\
      &(14.81;17.87)&(14.41;17.10)&(14.46;17.25)\vspace{0.2cm}\\
Wave 6&	15.95       &16.27        &16.33\\
      &(14.49;17.42)&(14.81;17.72)&(14.96;17.71)\\
   \hline\noalign{\smallskip}
\end{tabular}
\end{table}
\renewcommand{\baselinestretch}{2}

In most of the years the impact from rounding is much stronger than the impact due to (partial) nonresponse. While the differences between the poverty rates based on the unadjusted estimates and the estimates that account for the rounding (column one compared to column two) range from $-1.13$ and $+0.94$ percentage points, the differences between the adjusted estimates and the estimates that also account for the nonresponse (column two and column three) only range from $-0.29$ to $+0.1$ percentage points. The only exception is the second wave in which the poverty rate hardly changes between the na\"ive direct estimate and the adjusted estimate. The smaller impact of the nonresponse is to be expected given that only approximately 5\% of the records are imputed to adjust for nonresponse compared to approximately 85\% of the records that are imputed for rounding adjustments. Still, the differences in the poverty rates albeit small indicate that income is not missing completely at random and ignoring the nonresponse results in biased inferences.
%%We also note that the confidence intervals in the third column are shorter than the confidence intervals in the second column in most of %%the cases indicating more efficient estimates. This is because the extended approach presented in this paper fully exploits all the %%information from the partial respondents and also the information contained in the covariates of the nonrespondents which helps to improve %%the inferences regarding the income variable.


\section{Conclusion}
Obtaining reliable income information from surveys is notoriously difficult. Income is considered sensitive information and survey respondents often find it difficult to remember their exact income. In this paper we suggested a strategy to address two common potential sources of bias: nonresponse and rounding. Our multiple imputation approach tackles both problems simultaneously and provides a simple tool to incorporate interval information when making inference based on the collected data. The application to the Panel Study ``Labor Market and Social Security'' showed that adjusting for these two factors leads to substantial changes in politically important measures such as the poverty rate. We found that rounding has a higher impact on the results than nonresponse at least for our study in which the initial nonresponse was relatively low. Of course the proposed adjustments are based on several assumptions and it is important to critically review these assumptions. First, the correction methods are based on models and the underlying model assumptions need to be evaluated. Alternative models for the income distribution have been suggested in the literature. For example, \citet{gra:2013} suggested to model the income distribution using the generalized beta distribution of the second kind. However, it is not straightforward to incorporate covariates in this model. Furthermore, we feel that our model evaluations in Section \ref{mod_eval} indicate a good fit of the log-linear model for the conditional income distribution. Second, we assume that the income information is missing at random (MAR), i.e., the nonresponse can be explained by the variables included in the imputation model. This is a crucial assumption in most imputation models and this assumption can never be tested based on the observed data. We believe that the covariates in our model such as age of the respondent, deprivation index, or household size should help to explain the nonresponse in the data. However, if the MAR assumption does not hold, results from our imputation strategy will be biased and imputation models such as the non-ignorable models proposed in \citet[Chap. 15]{littlerubin} need to be considered. Finally, nonresponse and rounding might not be the only sources of bias in the data. Several studies found that individuals with low earnings tend to overreport their income while individuals with high income tend to underreport their income (see, for example, \citet{pis:1995}). Incorporating this additional measurement error into the adjustment strategy would be an interesting area of future research.


\bibliographystyle{natbib}
\bibliography{ref}



\end{document}
