Data Fusion: Identification Problems, Validity, and Multiple Imputation

Authors

  • Susanne Rässler Institute for Employment Research (IAB), Nürnberg, Germany

DOI:

https://doi.org/10.17713/ajs.v33i1&2.436

Abstract

Data fusion techniques typically aim to achieve a complete data file from different sources which do not contain the same units. Traditionally, data fusion, in the US also addressed by the term statistical matching, is done on the basis of variables common to all files. It is well known that those approaches establish conditional independence of the (specific) variables not jointly observed given the common variables, although they may be conditionally dependent in reality. However, if the common variables are (carefully) chosen in a way that already establishes conditional independence, then inference about the actually unobserved association is valid. In terms of regression analysis, this implies that the explanatory power of the common variables is high concerning the specific variables. Unfortunately, this assumption is not testable yet. Hence, we structure and discuss the objectives of statistical matching in the light of their feasibility. Four levels of validity
a matching technique may achieve are introduced. By means of suitable multiple imputation (MI) techniques, the identification problem which is inherent in data fusion is reflected. In a simulation study it is also shown that MI allows to efficiently and easily use auxiliary information.

References

Barnard, J., Rubin, D.B. (1999). Small-Sample Degrees of Freedom with Multiple Imputation, Biometrika, 86, 948-955.

Box, G.E.P., Tiao, G.C. (1992). Bayesian Inference in Statistical Analysis. Wiley, New York.

Cox, D.R., Wermuth, N. (1996). Multivariate Dependencies. Chapman and Hall, London.

Kadane, J.B. (2001). Some Statistical Problems in Merging Data Files, Journal of Official Statistics, 17, 423-433.

Little, R.J.A., Rubin, D.B. (2002). Statistical Analysis with Missing Data. Wiley, New York.

Liu, T.P., Kovacevic, M.S. (1997). An Empirical Study on Categorically Constrained Matching, Proceedings of the Survey Methods Section, Statistical Society of Canada, 167-178.

Moriarity, C., Scheuren, F. (2001). Statistical Matching: A Paradigm for Assessing the Uncertainty in the Procedure, Journal of Official Statistics, 17, 407-422.

Moriarity, C., Scheuren, F. (2003). A Note on Rubin’s Statistical Matching Using File Concatenation With Adjusted Weights and Multiple Imputations, Journal of Business & Educational Studies, 21, 65-73.

D’Orazio, M., Di Zio, M., Scanu, M. (2003). Statistical matching and the likelihood principle: uncertainty and logical constraints, ISTAT Technical Report.

Rässler, S. (2002). Statistical Matching: A Frequentist Theory, Practical Applications, and Alternative Bayesian Approaches. Lecture Notes in Statistics, 168, Springer, New York.

Rässler, S., Fleischer, K. (1998). Aspects Concerning Data Fusion Techniques, ZUMA Nachrichten Spezial, 4, 317-333.

Rässler, S., Koller, F., Mäenpää, C. (2002). A Split Questionnaire Survey Design applied to German Media and Consumer Surveys, Proceedings of the International Conference on Improving Surveys, ICIS 2002, Copenhagen.

Raghunathan, T.E., Grizzle, J.E. (1995). A Split Questionnaire Survey Design, Journal of the American Statistical Association, 90, 54-63.

Rodgers, W.L. (1984). An Evaluation of Statistical Matching, Journal of Business and Econometric Statistics, 2, 91-102.

Rubin, D.B. (1974). Characterizing the Estimation of Parameters in Incomplete-Data Problems, Journal of the American Statistical Association, 69, 467-474.

Rubin, D.B. (1977). Formalizing subjective notations about the effect of nonrespondents in sample surveys, Journal of the American Statistical Association, 72, 538-543.

Rubin, D.B. (1986). Statistical Matching Using File Concatenation With Adjusted Weights and Multiple Imputations, Journal of Business and Economic Statistics, 4, 87-95.

Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley, New York.

Rubin, D.B. (2002). Using Propensity Scores to Help Design Observational Studies: Application to the Tobacco Litigation, Health Services & Outcomes Research Methodology, 2, 178-186.

Rubin, D.B., Thayer, D. (1978). Relating Tests Given to Different Samples, Psychometrika, 43, 3-10.

Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data. Chapman and Hall, London.

Sims, C.A. (1972). Comments, Annals of Economic and Social Measurement, 1, 343-345.

Van der Putten, P., Kok, J.N., Gupta, A. (2002). Data Fusion Through Statistical Matching, MIT Sloan School of Management, Working Paper 4342-02.

Wendt, F. (1986). Einige Gedanken zur Fusion, Auf dem Wege zum Partnerschaftsmodell, Arbeitsgemeinschaft Media-Analyse e.V., Media-Micro-Census GmbH, Frankfurt, 109-140.

Downloads

Published

2016-04-03

How to Cite

Rässler, S. (2016). Data Fusion: Identification Problems, Validity, and Multiple Imputation. Austrian Journal of Statistics, 33(1&2), 153–171. https://doi.org/10.17713/ajs.v33i1&2.436

Issue

Section

Articles