Data Integration and Record Matching: An Austrian Contribution to Research in Official Statistics
Data integration techniques are one of the core elements of DIECOFIS, an EU-funded international research project that aims at developing a methodology for the construction of a system of indicators on competitiveness and fiscal impact on enterprise performance. Data integration is also of major interest for official statistics agencies as a means of using available information more efficiently and improving the quality of the agency’s products. The Austrian member of the project consortium comprises university departments, representatives from the Bundesanstalt Statistik Austria, from the Statistical Department of the Austrian Economic Chamber, and from ec3, a non-profit research corporation. This paper gives a short report on DIECOFIS in general and on the Austrian contribution to the project, mainly dealing with data integration methodology. Various papers that have been read at the DIECOFIS workshop last November in Vienna, will be published as a Special Issue of the Austrian Journal of Statistics.
W. Alvey and B. Jamerson, editors. Record Linkage Techniques. Federal Committee on Statistical Methodology (FCSM), Washington, DC, 1997.
R. Baxter, P. Christen, and T. Churches. A Comparison of Fast Blocking Methods for Record Linkage. To appear in Proc. First Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 9th ACM SIGKDD Int. Conf. on Knowledge Discovery & Data Mining, Washington, DC, 2003.
T.R. Belin and D.B. Rubin. A Method for Calibrating False-Match Rates in Record Linkage. JASA. 90(430):694–707, 1995.
L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees. Wadsworth, Monterey, 1984.
A. Chatterjee and A. Segev. Resolving Data Heterogeneity in Scientific Statistical Databases. In H. Hinterberger and J.C. French, editors, Proc. 6th Int. Conf. on Scientific and Statistical Database Management, pages 145-159. ETH Zürich, 1992.
A. Chatterjee and A. Segev. Supporting Statistics in Extensible Databases: A Case Study. In H. Hinterberger and J.C. French, editors, Proc. 7th Int. Conf. on Scientific
and Statistical Database Management, pages 54-63. IEEE Computer Society, 1994.
S. Cohen. Micro-simulation of Firm Investment. Presented at the Symposium on Economic Modelling, London University, 1991.
W.W. Cohen, P. Ravikumar, and S.E. Fienberg. A Comparison of String Distance Metrics for Name-Matching Tasks. Submitted to 18th International Joint Conference
Workshop on Information Integration on the Web, 2003. Also available at http://www.niss.org/dg/technicalreports.html.
F.J. Damerau. A Technique for Computer Detection and Correction of Spelling Errors. Communications of the ACM. 7(3):171-176, 1964.
A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. JRSS B 39:1-38, 1977.
M. Denk Statistical Data Combination: A Metadata Framework for Record Linkage Procedures. Dissertation, Department of Statistics and Decision Support Systems,
University of Vienna, 2002.
M. Denk and K.A. Froeschl. The IDARESA Data Mediation Architecture for Statistical Aggregates. Research in Official Statistics. 3(1):7-38, 2000.
M. Denk and F. Oropallo. Overview of the Issues in Multi-Source Databases. DIECOFIS Deliverable 1.1, ISTAT, Rome, 2002.
M. Denk, F. Inglese, and M.G. Calza. Assessment of Different Approaches for the Integration of Sample Surveys. DIECOFIS Deliverable 1.2, ISTAT, Rome, 2003.
M. Denk, F. Inglese, and F. Oropallo. Report on Statistical Indicators for the Assessment of Multi-source Databases. DIECOFIS Deliverable 1.3, ISTAT, Rome, 2003.
DIECOFIS. DIECOFIS Web Site, http://petra1.istat.it/diecofis/index.html, 2003.
M. D’Orazio, M. Di Zio, and M. Scanu. Statistical Matching: a tool for integrating data in National Statistical Institutes. In Proc. of the Joint ETK and NTTS Conference for
Official Statistics, Crete, 2001.
M.E. Fair and P. Whitridge. Tutorial on Record Linkage. In W. Alvey and B. Jamerson, editors, Record Linkage Techniques, pages 457-479. FCSM, Washington, DC, 1997.
FCSM – Federal Committee on Statistical Methodology. Report on Exact and Statistical Matching Techniques. Statistical Policy Working Paper 5, U.S. Department of
Commerce ,Washington, DC, 1980.
I.P. Fellegi and A.B. Sunter. A Theory for Record Linkage. JASA. 64:1183-1210, 1969.
W.B. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Upper Saddle River, NJ, 1992.
K.A. Froeschl. Metadata Management in Statistical Information Processing. Springer, Wien, Berlin, 1997.
K.A. Froeschl. On Standards of Formal Communication in Statistics. Working Paper No. 16, UN-ECE/METIS, Work Session on Statistical Metadata, 1999a.
K.A. Froeschl. Metadata Management in Official Statistics - An IT-based Methodology Approach. Austrian Journal of Statistics. 28(2):49-79, 1999b.
K.A. Froeschl. A Sketch of Statistical Meta-Computing as a Data Integration Framework. To appear in Austrian Journal of Statistics, Special Issue on Data Integration and Record Matching. 2004.
K.A. Froeschl and W. Grossmann. The Role of Metadata in Using Administrative Sources. Research in Official Statistics. 3(1):65-82, 2000.
L.E. Gill. Methods for automatic record matching and linking in their use in National Statistics. GSS Methodology Series, NSMS25. Office for National Statistics, UK, 2001.
P.A.V. Hall and G.R. Dowling. Approximate String Matching. ACM Computing Surveys. 12(4):381-402, 1980.
T.H. Hassard. Writing the Book of Life: Medical Record Linkage. In Brook, et al., editors, The Fascination of Statistics, pages 25-46. Dekker, New York, 1986.
IDARESA. The Data Model – Final Version, Deliverable 3.4.2, Dept. of Statistics, University of Vienna, 1997.
IDARESA. IDARESA Tandem Structures, TPR–viu–3.4.2/3, Dept. of Statistics, University of Vienna, 1998a.
IDARESA. The IDARESA info-Net, TPR–viu–3.2.1, Dept. of Statistics, University of Vienna, 1998b.
T.B. Jabine and F.J. Scheuren. Record Linkages for Statistical Purposes: Methodological Issues. Journal of Official Statistics. 2(3):255-277, 1986.
J.B. Kadane. Some Statistical Problems in Merging Data Files. In 1978 Compendium of Tax Research, pages 159–171. US Dept. of the Treasury, 1978. (Reprinted in Journal
of Official Statistics. 17(3):423-433, 2001.)
B. Kilss and W. Alvey, editors. Record Linkage Techniques. FCSM, Washington, DC, 1985.
J.G. Kovar and P.J. Whitridge. Imputation of Business Survey Data. In B. Cox et al., editors, Business Survey Methods. John Wiley, New York, 1995.
V.I. Levenstein. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Sov. Phys. Dokl. 10:707-710, 1966.
R.J.A. Little and D.B. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, New York, 1987.
F.M. Malvestuto. A Universal Table Model for Categorical Databases. Information Sciences. 49:203-223, 1989.
F.M. Malvestuto. Data Integration in Statistical Databases. In Z. Michalewicz, editor, Statistical and Scientific Databases, pages 201-232. Ellis Horwood, Chichester, 1991.
F.M. Malvestuto. A Universal-Scheme Approach to Statistical Databases Containing Homogeneous Summary Tables. ACM Transactions on Database Systems. 18:678-
X.L. Meng and D.B. Rubin. Maximum Likelihood Estimation via the ECM Algorithm: A General Framework. Biometrika. 80(2):267-278, 1993.
C. Moriarity and F. Scheuren. Statistical Matching: A Paradigm for Assessing the Uncertainty in the Procedure. Journal of Official Statistics. 17(3):407-422, 2001.
M. Neiling. Data Fusion with Record Linkage. Presented at the 3rd Workshop “Föderierte Datenbanken”, Magdeburg, 1998. Also available at http://wwwiti.cs.unimagdeburg.
M. Neiling and H.J. Lenz. The Creation of the Register Based Census for Germany in 2001: An Application of Data Integration. In Betriebswirtschaftliche Reihe:
Diskussionsbeiträge des Fachbereichs Wirtschaftswissenschaft der FU Berlin 34. Freie Universität Berlin, 1999.
M. Neiling and H.J. Lenz. Data Fusion and Object Identification. Presented at SSGRR 2000 (Advances in Infrastructure for Electronic Business, Science and Education on the Internet). Available at http://www.ssgrr.it/en/ssgrr2000/proceedings.htm.
B.A. Okner. Constructing a New Data Base from Existing Microdata Sets: the 1966 Merge File. Annals of Economic and Social Measurement. 1:325-342, 1972.
B.A. Okner. Data Matching and Merging: An Overview. Annals of Economic and Social Measurement. 3(2):347-352, 1974.
E. Porter and W. Winkler. Approximate String Comparison and its Effect on an Advanced Record Linkage System, RR97-02, U.S. Bureau of the Census, 1997.
Available at http://www.census.gov/srd/www/byyear.html.
K. Pu. Key Equivalence in Heterogeneous Databases. In Proc. 1st Int. Workshop on Interoperability in Multidatabase Systems, Kyoto, Japan, pages 314-316. IEEE
Comp. Soc. Press, 1991.
D.B. Radner. The Development of Statistical Matching in Economics. In Proc. Social Statistics Section, pages 503-508. American Statistical Association, 1978.
D.B. Radner and H.J. Muller. Alternative Types of Record Matching: Costs and Benefits. In Proc. Social Statistics Section, pages 756-761. American Statistical Association, 1977.
S. Raessler. Statistical Matching: A Frequentist Theory, Practical Applications, and Alternative Bayesian Approaches. Springer, New York, 2002.
W.L. Rodgers. An Evaluation of Statistical Matching. Journal of Business and Economic Statistics. 2:91-102, 1984.
W.L. Rodgers and E.B. DeVol. An Evaluation of Statistical Matching. In Proc. of the Survey Research Methods Section, pages 128-132. American Statistical Association,
D.B. Rubin. Multiple Imputation for Nonresponse in Surveys.: John Wiley & Sons, New York, 1987.
A.P. Sheth and J.A. Larson. Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. ACM Computing Surveys. 22(3):183-236, 1990
V. Ventrone and S. Heiler. Semantic Heterogeneity as a Result of Domain Evolution. ACM SIGMOD record. 20(4):16-20, 1991.
Y.R. Wang and S.E. Madnick. The Inter-Database Instance Identification Problem in Integrating Autonomous Systems. In Proc. of the 6th International Conference on Data Engineering, Los Angeles, pages 46-55. IEEE, 1989.
G. Wiederhold. Mediators in the Architecture of Future Information Systems. IEEE Computer. 25(3):38-49, 1992.
G. Wiederhold and M. Genesereth. The Conceptual Basis for Mediation Services. IEEE Expert. 12(5):38-47, 1997.
W. Winkler. Preprocessing of Lists and String Comparison. In B. Kilss, W. Alvey, editors, Record Linkage Techniques, pages 181-187. FCSM, Washington, DC, 1985.
W. Winkler. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. In Proc. Section on Survey Research Methods,
pages 354-359. American Statistical Association, 1990.
W. Winkler. Matching and Record Linkage. In B. Cox et al., editors, Business Survey Methods, pages 355-384. J. Wiley, New York, 1995.
W. Winkler. The State of Record Linkage and Current Research Problems, RR99-04, U.S. Bureau of the Census, 1999. See http://www.census.gov/srd/www/byyear.html.
W. Winkler. Quality of Very Large Databases, RR2001/04, U.S. Bureau of the Census, 2001.
C.F.J. Wu. On the Convergence Properties of the EM-Algorithm. Annals of Statistics. 11(1):95-103, 1983.
The Austrian Journal of Statistics publish open access articles under the terms of the Creative Commons Attribution (CC BY) License.
The Creative Commons Attribution License (CC-BY) allows users to copy, distribute and transmit an article, adapt the article and make commercial use of the article. The CC BY license permits commercial and non-commercial re-use of an open access article, as long as the author is properly attributed.
Copyright on any research article published by the Austrian Journal of Statistics is retained by the author(s). Authors grant the Austrian Journal of Statistics a license to publish the article and identify itself as the original publisher. Authors also grant any third party the right to use the article freely as long as its original authors, citation details and publisher are identified.
Manuscripts should be unpublished and not be under consideration for publication elsewhere. By submitting an article, the author(s) certify that the article is their original work, that they have the right to submit the article for publication, and that they can grant the above license.