Data Integration: Techniques and Evaluation
AbstractWithin the DIECOFIS framework, ec3, the Division of Business Statistics from the Vienna University of Economics and Business Administration and ISTAT worked together to find methods to create a comprehensive database of enterprise data required for taxation microsimulations via integration of existing disparate enterprise data sources. This
paper provides an overview of the broad spectrum of investigated methodology (including exact and statistical matching as well as imputation) and related statistical quality indicators, and emphasises the relevance of data integration, especially for official statistics, as a means of using available information more efficiently and improving the quality of a statistical agency’s products. Finally, an outlook on an empirical study
comparing different exact matching procedures in the maintenance of Statistics Austria’s Business Register is presented.
W. Alvey and B. Jamerson, editors. Record Linkage Techniques. Federal Committee on Statistical Methodology (FCSM), Washington, DC, 1997.
R. Baxter, P. Christen, and T. Churches. A Comparison of Fast Blocking Methods for Record Linkage. To appear in Proc. First Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 9th ACM SIGKDD Int. Conf. on Knowledge Discovery
& Data Mining, Washington, DC, 2003.
T.R. Belin and D.B. Rubin. A Method for Calibrating False-Match Rates in Record Linkage. JASA. 90(430):694–707, 1995.
L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees. Wadsworth, Monterey, 1984.
A. Chatterjee and A. Segev. Resolving Data Heterogeneity in Scientific Statistical Databases. In H. Hinterberger and J.C. French, editors, Proc. 6th Int. Conf. on Scientific and Statistical Database Management, pages 145-159. ETH Zürich, 1992.
A. Chatterjee and A. Segev. Supporting Statistics in Extensible Databases: A Case Study. In H. Hinterberger and J.C. French, editors, Proc. 7th Int. Conf. on Scientific and Statistical Database Management, pages 54-63. IEEE Computer Society, 1994.
S. Cohen. Micro-simulation of Firm Investment. Presented at the Symposium on Economic Modelling, London University, 1991.
W.W. Cohen, P. Ravikumar, and S.E. Fienberg. A Comparison of String Distance Metrics for Name-Matching Tasks. Submitted to 18th International Joint Conference Workshop on Information Integration on the Web, 2003. Also available at http://www.niss.org/dg/technicalreports.html.
F.J. Damerau. A Technique for Computer Detection and Correction of Spelling Errors. Communications of the ACM. 7(3):171-176, 1964.
A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. JRSS B 39:1-38, 1977.
M. Denk Statistical Data Combination: A Metadata Framework for Record Linkage Procedures. Dissertation, Department of Statistics and Decision Support Systems, University of Vienna, 2002.
M. Denk and K.A. Froeschl. The IDARESA Data Mediation Architecture for Statistical Aggregates. Research in Official Statistics. 3(1):7-38, 2000.
M. Denk, K.A. Froeschl and W. Grossmann. Statistical Composites: A Transformationbound Representation of Statistical Datasets. In J. Kennedy, editor, Proc. 14th Int. Conf. Scientific and Statistical Database Management (Edinburgh, UK), pages 217-226. IEEE Computer Society Press, Los Alamitos, Ca., 2002.
M. Denk and P. Hackl. Data Integration and Record Matching: An Austrian Contribution to Research in Official Statistics. Austrian Journal of Statistics. 32(4):305-321, 2003.
M. Denk and F. Oropallo. Overview of the Issues in Multi-Source Databases. DIECOFIS Deliverable 1.1, ISTAT, Rome, 2002.
M. Denk, F. Inglese, and M.G. Calza. Assessment of Different Approaches for the Integration of Sample Surveys. DIECOFIS Deliverable 1.2, ISTAT, Rome, 2003.
M. Denk, F. Inglese, and F. Oropallo. Report on Statistical Indicators for the Assessment of Multi-source Databases. DIECOFIS Deliverable 1.3, ISTAT, Rome, 2003.
DIECOFIS. DIECOFIS Web Site, http://petra1.istat.it/diecofis/index.html, 2003.
M. D’Orazio, M. Di Zio, and M. Scanu. Statistical Matching: a tool for integrating data in National Statistical Institutes. In Proc. of the Joint ETK and NTTS Conference for Official Statistics, Crete, 2001.
M.E. Fair and P. Whitridge. Tutorial on Record Linkage. In W. Alvey and B. Jamerson, editors, Record Linkage Techniques, pages 457-479. FCSM, Washington, DC, 1997.
FCSM – Federal Committee on Statistical Methodology. Report on Exact and Statistical Matching Techniques. Statistical Policy Working Paper 5, U.S. Department of Commerce, Washington, DC, 1980.
I.P. Fellegi and A.B. Sunter. A Theory for Record Linkage. JASA. 64:1183-1210, 1969.
W.B. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Upper Saddle River, NJ, 1992.
K.A. Froeschl. A Sketch of Statistical Meta-Computing as a Data Integration Framework. Austrian Journal of Statistics, Special Issue on Data Integration and Record Matching, 33: ???-???, 2004.
K.A. Froeschl and W. Grossmann. The Role of Metadata in Using Administrative Sources. Research in Official Statistics. 3(1):65-82, 2000.
L.E. Gill. Methods for automatic record matching and linking in their use in National Statistics. GSS Methodology Series, NSMS25. Office for National Statistics, UK, 2001.
P.A.V. Hall and G.R. Dowling. Approximate String Matching. ACM Computing Surveys. 12(4):381-402, 1980.
A. Haslinger. Data Matching for the Maintenance of the Austrian Business Register. Austrian Journal of Statistics, Special Issue on Data Integration and Record Matching, 33: ???-???, 2004.
T.H. Hassard. Writing the Book of Life: Medical Record Linkage. In Brook, et al., editors, The Fascination of Statistics, pages 25-46. Dekker, New York, 1986.
D.S. Hirschberg. A Linear Space Algorithm for Computing Maximal Common Subsequences. Communication of the ACM. 18:341-343, 1975
IDARESA. The Data Model – Final Version, Deliverable 3.4.2, Dept. of Statistics, University of Vienna, 1997.
IDARESA. IDARESA Tandem Structures, TPR–viu–3.4.2/3, Dept. of Statistics, University of Vienna, 1998.
T.B. Jabine and F.J. Scheuren. Record Linkages for Statistical Purposes: Methodological Issues. Journal of Official Statistics. 2(3):255-277, 1986.
J.B. Kadane. Some Statistical Problems in Merging Data Files. In 1978 Compendium of Tax Research, pages 159–171. US Dept. of the Treasury, 1978. (Reprinted in Journal of Official Statistics. 17(3):423-433, 2001.)
B. Kilss and W. Alvey, editors. Record Linkage Techniques. FCSM, Washington, DC, 1985.
J.G. Kovar and P.J. Whitridge. Imputation of Business Survey Data. In B. Cox et al., editors, Business Survey Methods. John Wiley, New York, 1995.
V.I. Levenstein. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Sov. Phys. Dokl. 10:707-710, 1966.
F. Linder. The Dutch Virtual Census 2001: A New Approach by Combining Administrative Registers and Household Sample Surveys. Austrian Journal of Statistics, Special Issue on Data Integration and Record Matching, 33: ???-???, 2004.
R.J.A. Little and D.B. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, New York, 1987.
F.M. Malvestuto. A Universal Table Model for Categorical Databases. Information Sciences. 49:203-223, 1989.
F.M. Malvestuto. Data Integration in Statistical Databases. In Z. Michalewicz, editor, Statistical and Scientific Databases, pages 201-232. Ellis Horwood, Chichester, 1991.
F.M. Malvestuto. A Universal-Scheme Approach to Statistical Databases Containing Homogeneous Summary Tables. ACM Transactions on Database Systems. 18:678-708, 1993.
X.L. Meng and D.B. Rubin. Maximum Likelihood Estimation via the ECM Algorithm: A General Framework. Biometrika. 80(2):267-278, 1993.
C. Moriarity and F. Scheuren. Statistical Matching: A Paradigm for Assessing the Uncertainty in the Procedure. Journal of Official Statistics. 17(3):407-422, 2001.
R. Müllauer. TST-2 – Benutzerhandbuch (in German). Internal Report, Statistics Austria, Vienna, 2003.
M. Neiling. Data Fusion with Record Linkage. Presented at the 3rd Workshop “Föderierte Datenbanken”, Magdeburg, 1998. Also available at http://wwwiti.cs.unimagdeburg.de/fdb98/online-proc/.
M. Neiling and H.J. Lenz. The Creation of the Register Based Census for Germany in 2001: An Application of Data Integration. In Betriebswirtschaftliche Reihe: Diskussionsbeiträge des Fachbereichs Wirtschaftswissenschaft der FU Berlin 34. Freie Universität Berlin, 1999.
S. Nikles and R. Müllauer. TST-2 – Textsuchtetool Version 2: Beschreibung der Methode zum Abgleich von Dateien (in German). Internal Report, Statistics Austria, Vienna, 2003.
B.A. Okner. Constructing a New Data Base from Existing Microdata Sets: the 1966 Merge File. Annals of Economic and Social Measurement. 1:325-342, 1972.
B.A. Okner. Data Matching and Merging: An Overview. Annals of Economic and Social Measurement. 3(2):347-352, 1974.
E. Porter and W. Winkler. Approximate String Comparison and its Effect on an Advanced Record Linkage System, RR97-02, U.S. Bureau of the Census, 1997. Available at http://www.census.gov/srd/www/byyear.html.
D.B. Radner. The Development of Statistical Matching in Economics. In Proc. Social Statistics Section, pages 503-508. American Statistical Association, 1978.
D.B. Radner and H.J. Muller. Alternative Types of Record Matching: Costs and Benefits. In Proc. Social Statistics Section, pages 756-761. American Statistical Association, 1977.
S. Raessler. Statistical Matching: A Frequentist Theory, Practical Applications, and Alternative Bayesian Approaches. Springer, New York, 2002.
P. Roberti. The DIECOFIS Project: Progress and Lessons. Austrian Journal of Statistics, Special Issue on Data Integration and Record Matching, 33, ???-???, 2004.
W.L. Rodgers. An Evaluation of Statistical Matching. Journal of Business and Economic Statistics. 2:91-102, 1984.
W.L. Rodgers and E.B. DeVol. An Evaluation of Statistical Matching. In Proc. of the Survey Research Methods Section, pages 128-132. American Statistical Association, 1981.
D.B. Rubin. Multiple Imputation for Nonresponse in Surveys.: John Wiley & Sons, New York, 1987.
R. Schaumann. Die Registerverordnung der EU und deren Umsetzung in Österreich (in German). In: N. Rainer, editor, Österreichs Statistik in der Europäischen Integration, ÖstASt Nr. 2, ÖSTAT, Wien, pages 91-99, 1999.
R. Schnell, T. Bachteler, and S. Bender. A toolbox for record linkage. Austrian Journal of Statistics, Special Issue on Data Integration and Record Matching, 33: ???-???, 2004.
SecondString. SecondString Project Page http://secondstring.sourceforge.net/, (2004).
Y.R. Wang and S.E. Madnick. The Inter-Database Instance Identification Problem in Integrating Autonomous Systems. In Proc. of the 6th International Conference on Data Engineering, Los Angeles, pages 46-55. IEEE, 1989.
W. Winkler. Preprocessing of Lists and String Comparison. In B. Kilss, W. Alvey, editors, Record Linkage Techniques, pages 181-187. FCSM, Washington, DC, 1985.
W. Winkler. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. In Proc. Section on Survey Research Methods, pages 354-359. American Statistical Association, 1990.
W. Winkler. Matching and Record Linkage. In B. Cox et al., editors, Business Survey Methods, pages 355-384. J. Wiley, New York, 1995.
W. Winkler. The State of Record Linkage and Current Research Problems, RR99-04, U.S. Bureau of the Census, 1999. See http://www.census.gov/srd/www/byyear.html.
W. Winkler. Quality of Very Large Databases, RR2001/04, U.S. Bureau of the Census, 2001.
C.F.J. Wu. On the Convergence Properties of the EM-Algorithm. Annals of Statistics. 11(1):95-103, 1983.
How to Cite
The Austrian Journal of Statistics publish open access articles under the terms of the Creative Commons Attribution (CC BY) License.
The Creative Commons Attribution License (CC-BY) allows users to copy, distribute and transmit an article, adapt the article and make commercial use of the article. The CC BY license permits commercial and non-commercial re-use of an open access article, as long as the author is properly attributed.
Copyright on any research article published by the Austrian Journal of Statistics is retained by the author(s). Authors grant the Austrian Journal of Statistics a license to publish the article and identify itself as the original publisher. Authors also grant any third party the right to use the article freely as long as its original authors, citation details and publisher are identified.
Manuscripts should be unpublished and not be under consideration for publication elsewhere. By submitting an article, the author(s) certify that the article is their original work, that they have the right to submit the article for publication, and that they can grant the above license.