Providing Data With High Utility And No Disclosure Risk For The Public and Researchers : An Evaluation By Advanced Statistical Disclosure Risk Methods

The demand of data from surveys, registers or other data sets containing sensible information on people or enterprises have been increased significantly over the last years. However, before providing data to the public or to researchers, confidentiality has to be respected for any data set containing sensible individual information. Confidentiality can be achieved by applying statistical disclosure control (SDC) methods to the data. The research on SDC methods becomes more and more important in the last years because of an increase of the awareness on data privacy and because of the fact that more and more data are provided to the public or to researchers. However, for legal reasons this is only visible when the released data has (very) low disclosure risk. In this contribution existing disclosure risk methods are review and summarized. These methods are finally applied on a popular real-world data set the Structural Earnings Survey (SES) of Austria. It is shown that the application of few selected anonymisation methods leads to well-protected anonymised data with high data utility and low information loss.


Introduction
A microdata file is defined as a data set on individual level.For each observation a set of variables is typically available.Concerning SDC, these variables can be split into three categories.
• Direct Identifiers: Variables that definitely identify a statistical unit.For example, the social insurance number, name of companies or people or addresses are considered as direct identifiers.
• Key variables: A set of variables that -when considered together -may be used to identify an individual unit.For example with the combination of gender, age, region and occupation some individuals may be identified.Other examples for (confidential) Statistical Disclosure Risk Methods key variables could be income, health information, nationality or political preferences.
For the description of the methods, it is advantageous to distinguish between categorical and continuous scaled key variables.
• Non-confidential variables: All variables that are not classified in any of the former two groups.
The goal of anonymizing a microdata set is to prevent that confidential information can be linked to a specific respondent.The ultimative aim is to release a safe microdata set that has both, low risk of linking confidential information to individual respondents and high data utility.

Measuring disclosure risk
Measuring risk in an microdata set is of course of great concern when having to decide on whether a microdata set is safe to be released.To be able to assess the disclosure risk it is required to make realistic assumptions on the information data users might have at hand to match against the microdata set.These assumptions are called 'disclosure risk scenarios'.
Based on a specific disclosure risk scenario one must define a set of identifying variables (key variables) that can be used as input for the risk evaluation procedure.
Typically risk evaluation is based on the concept of "rareness/uniquenessss" in the sample and/or in the population.The interest is on units/individuals/observations that possess rare combinations of key variables.Those can be assumed to be identified easier and thus have higher risk.It is possible to cross tabulate all identifying variables and have a look at its cast.
Patterns 1 with only very few individuals are in this sense considered risky if they have also low sampling weights, i.e. if the expected individuals with the same pattern is expected to be low in the population.

Frequencies counts
Consider a random sample of size n drawn from a finite population of size N .Let π j , j = 1, . . ., N be the (first order) inclusion probabilities, i.e. the probability that the element u j of a population of the size N is chosen in a sample of the size n.
All possible combinations of categories in the key variables X 1 , . . ., X m can be calculated by cross tabulation of these categorical variables.Let f i , i = 1, . . ., n be the frequency counts obtained by cross tabulation and let F i be the frequency counts of the population which belong to the same category.If f i = 1 applies the corresponding observation is unique in the sample.If F i = 1 applies then the observation is unique in the population.Note that F i is usually unknown since usually information on samples is collected and only few information about the population is known from registers and/or external sources.

The k-anonymity concept
Based on a set of key variables a desired characteristic of a protected microdata set might be to achieve k-anonymity (Samarati and Sweeney 1998;Sweeney 2002).This means that each possible combination of the values of the key variables features at least k units in the microdata, meaning that all k-anonymity is typically provided by recoding categorical key variables and by additionally suppressing specific values in the key variables of individual units.
An extension of k-anonymity is l-diversity (Machanavajjhala, Kifer, Gehrke, and Venkitasubramaniam 2007).Consider for one group of observations with the same pattern in the key variables and let the group fulfill k-anonymity.A possible data intruder can therefore not identify an individual in this group.However, if all observations have the same entries in a sensitive variable (such as cancer in the variable medical diagnosis) then the attack is successful anyway.

Considering sample frequencies on subsets: SUDA2
SUDA (Special Uniques Detection Algorithm) estimates a disclosure risk for each individual.SUDA2 (see, e.g., Manning, Haglin, and Keane 2008) is a recursive algorithm for finding Minimal Sample Uniques.The algorithm generates all possible variable subsets of defined categorical key variables and scans them for unique patterns in the subsets of variables.The risk of an observation is then dependend on two aspects.
(a) The lower the amount of variables needed to receive uniquenesss, the higher the risk (and the higher the suda score) of the corresponding observation.
(b) The larger the number of minimal sample uniqueness contained within an observation, the higher the risk of the observation.
(a) is calculated for each observation i by , for m the depth (the maximum size of variable subsets of the key variables), M SU min i the number of minimal uniques of observation i and n the number of observations of the data set.Since each observation is treated independently, the l i that belongs to one pattern are summed up to result in a common suda score for each of the observation belonging to this pattern (this summation is the contribution of (b)).
To result in the final SUDA score, the suda score are normalized due division by p!, with p being the number of key variables.The so called DIS suda score is then calculated from the suda and the so called DIS scores (we refer to Elliot 2000, for details).SUDA2 does not consider sampling weights and biased estimates may therefore result.

Considering population frequencies -the individual risk
To define if an individual unit is at risk, typically a threshold approach is used.If the individual risk of re-identification for an individual is above a certain threshold value, the unit is said to be at risk.To calculate the individual risks it is necessary to estimate the frequency of a given key in the population.In the previous section, Section 2.1, the population frequencies have already been estimated.However, one can show that these estimates almost always overestimate small population frequency counts (details can be found in Templ and Meindl 2010) and should not be used to estimate the disclosure risk.
A better approach is to use so-called super-population models in which population frequency counts are modeled given a certain distribution.The whole estimation procedure of sample counts given the population counts can be modeled, for example, by using a Negative Binomial distribution (see, e.g., Rinott and Shlomo 2006).It is out of scope of the paper to explain the final measurement of individual risk in this contribution but it can be found in Franconi and Polettini (2004) and Templ and Meindl (2010).

Measuring the global risk
Although the individual risk have to be respected since a data intruder should not be able to identify individuals, often also a measure of the global risk is estimated to express the risk of the whole data set with one number.

Measuring the global risk based on the individual risks
The first approach is to determine a threshold for the individual risk and to calculate the percentage of individuals that have larger individual risk than this threshold.

Measuring the risk using log-linear models
The sample frequencies, considered for each of M patterns m, f m , m = 1, ..., M can be modeled by a Poisson distribution, and the global risk may be defined as (see Skinner and Holmes 1998) For simplicity, the inclusion probabilities are assumed to be equal, π m = π , m = 1, ..., M .τ 1 can be estimated by log-linear models including the main effects and possible interactions.

Measuring risk for continuous key variables
Applying the concept of uniquenesss and k-anonymity on quantitative variables results that every observation in the data set is unique.Hence, this approach will fail for continuous key variables.
If detailed information about a value of a continuous scaled variable is available, one may be able to identify (by linking information) and eventually gain further information about an individual.For continuous key variables it is assumed that an intruder has information about a statistical unit

Distance-based record linkage
By using distance based record linkage methods the aim is to find the nearest neighbors between observations from two data sets.Domingo-Ferrer and Torra (2001) has shown that these methods outperform probabilistic methods.Generally, it is evaluated if the original value falls within an interval centered on the masked value.Such an interval might be based on the standard deviation of the variable (see also Mateo-Sanz, Sebe, and Domingo-Ferrer 2004).
Almost all data sets from Official Statistics consists of statistical units whose values in at least one variable are quite different from the main part of the observations.This leads to the fact that these variables are very asymmetric distributed.Such outliers might be enterprises with a very large value for turnover, for example, or persons with extremely high income or even multivariate outliers.Other disclosure risk methods that are not used in this contribution take the "outlyingness" of an observation into account (for details, see, Templ and Meindl 2008).

Application to the statistics on earnings survey
The Structural Earnings Survey (SES) is conducted in almost all European countries and it includes variables on earnings of employees and other variables on employees and employment level (e.g.region, size of the enterprise, economic activities of the enterprise, gender and age of the employees, . . .).
Generally such linked employer-employee data are used to identify the determinants/differentials of earnings but also some indicators are directly derived from the hourly earnings like the gender pay gap or the Gini coefficient (Gini 1912).The most classical example is the income inequality between genders as discussed in Groshen (1991), for example.
A correct identification of factors influencing the earnings could lead to relevant evidencebased policy decisions.The research studies are usually focused on examining the determinants of disparities in earnings.
The Austrian SES 2006 survey data consists of 199.909 observations obtained from a twostage design -in the first stage of the design, the enterprises are chosen with certain inclusion probabilities depending on the enterprise size and location, in the second stage employee's in the selected enterprises are chosen with different inclusion probabilities (for more information have a look at Geissberger 2009).

Disclosure risk and information loss for SES
The following variables are chosen as key variables: Categorical key variables: size of enterprise (5 ordered categories), age (66 ordered categories), location (3 categories), economic activity (53 categories) Continuous key variables: hourly earnings, earnings sh: shuffling (Muralidhar and Sarathy 2006) The rows of Table 1 corresponds to disclosure risk and information loss measures -R:2-a (R:3-a): percent of observations violating 2(3)-anonymity, R:ind: percent of observations with individual risk below 0.01, R:suda: percent of observations having suda dis score lower than 0.1, R:glob1, R:glob2: global risks from log-linear models, R:num: distance-based disclosure risk, IL1: information loss IL1, IL:eig: information loss based on differences in the eigenvalues and IL:lm: model-based estimation information loss.The mentioned measures of information loss are briefly explained in the following.
, scaled distances between original and perturbed values for all p continuous key variables.
IL:eig: The relative absolute differences between the eigenvalues of the covariance standardized continuous key variables of the original and the perturbed variables.Table 1 let us to the following interpretation.The original unmodified SES data contains about 5.35 % of observations that violate 3-anonymity and about 2.48 % of risky observations (using the individual risk approach).For the original data, the global model-based risk is 0.83 (and 1.35) which is quite similar to the percentage of observations having high dis suda score (0.87).Of course, the risk on continuous key variables is 100 % and the information loss on that variables is zero.When recoding economic activity into less categories, the risk reduces by almost the factor of 5. When additionally recoding the variable age the risk reduces dramatically.After applying local suppression additionally, the risk for all risk methods zero, expect the individual risk.
The risk on continuous variables is evaluated for any method independently.It is very low for adding additive noise to the data but in the same time the information loss is inacceptable large.The information loss is very small for adding correlated noise, but the risk is still high.For microaggregation, the information loss is (almost) zero, but the risk is high.However, always three observations are aggregated and therefore anonymisation might be fine but the disclosure risk method is not suitable for microaggregation.The performance of shuffling is good, but the model based estimates differ more than 8 % after shuffling the data.
Probably the most interesting information loss measure -the measure which accounts for fitting a linear model on the data (IL:lm) reports that the information loss very low expect for the adding additive noise method and shuffling.

Conclusion
In this contribution, popular disclosure risk methods have been summarized.We stressed to measure the disclosure risk after the application of any SDC method to the data.Because of the limit of pages we only briefly focused on measuring the data utility and information loss, but it should be clear that the aim is both, to provide a data set with low disclosure risk and high data utility.
In the practical example, a very popular data set was used and the disclosure risk and data utility/information loss is evaluated.Hereby, the whole range of disclosure risk methods has IL:lm: |( ȳo w − ȳp w )/ ȳo w |, with ȳw the (Horwitz-Thompson) weighted mean of exponentials of the fitted values from the model log(earningsHour) ∼ age + Location + Sex + education + Occupation + economicActivity + Length + Size) (using weighted least squares estimation considering the sampling weights) obtained from the original (index o) and the perturbed data (index p).

Table 1 :
Templ et al. (2013))lting disclosure risk and information loss of the SES data.Disclosure risk and information loss on SES The the columns in Table1corresponds to the following data : correlated noise (defaults ofTempl et al. (2013)) add: additive noise (noise parameter equals 10, see Templ, Kowarik, and Meindl (2013)) corr