String Matching Techniques: An Empirical Assessment Based on Statistics Austria's Business Register
DOI:
https://doi.org/10.17713/ajs.v34i3.415Abstract
The maintenance and updating of Statistics Austria's business register requires a regularly matching of the register against other data sources; one of them is the register of tax units of the Austrian Federal Ministry of Finance. The matching process is based on string comparison via bigrams of enterprise names and addresses, and a quality class approach assigning pairs of register units into classes of different compliance (i.e., matching quality) based on bigram similarity values and the comparison of other matching variables, like the NACE code or the year of foundation.
Based on methodological research concerning matching techniques carried out in the DIECOFIS project, an empirical comparison of the bigram method and other string matching techniques was conducted: the edit distance, the Jaro algorithm and the Jaro-Winkler algorithm, the longest common subsequence and the maximal match were selected as appropriate alternatives and evaluated in the study.
This paper briey introduces Statistics Austria's business register and the corresponding maintenance process and reports on the results of the empirical study.
References
Apostolico, A., and Guerra, C. (1987). The longest common subsequence problem revisited. Algorithmica, 2, 315-336.
Baxter, R., Christen, P., and Churches, T. (2003). A comparison of fast blocking methods for record linkage. In Proceedings of the First Workshop on Data Cleaning, Record
Linkage, and Object Consolidation, 9th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Washington, DC.
Cohen,W.W., Ravikumar, P., and Fienberg, S. E. (2003). A comparison of string distance metrics for name-matching tasks. In Proceedings of the 18th International Joint
Conference Workshop on Information Integration on the Web.
Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM, 7, 171-176.
Denk, M. (2002). Statistical Data Combination: A Metadata Framework for Record Linkage Procedures. Unpublished doctoral dissertation, University of Vienna, Department
of Statistics and Decision Support Systems.
Denk, M., Fröschl, K., Hackl, P., and Rainer, N. (Eds.). (2004). Special issue on data integration and record matching. Austrian Journal of Statistics, 33, 1-264.
Denk, M., and Hackl, P. (2003). Data integration and record matching: An Austrian contribution to research in ofcial statistics. Austrian Journal of Statistics, 32, 305-321.
Denk, M., Inglese, F., and Calza, M. G. (2003). Assessment of different approaches for the integration of sample surveys. DIECOFIS Deliverable 1.2, ISTAT, Rome.
Denk, M., Inglese, F., and Oropallo, F. (2003). Report on statistical indicators for the assessment of multi-source databases. DIECOFIS Deliverable 1.3, ISTAT, Rome.
Denk, M., and Oropallo, F. (2002). Overview of the issues in multi-source databases. DIECOFIS Deliverable 1.1, ISTAT, Rome. DIECOFIS. (2003). DIECOFIS web site. http://petra1.istat.it/diecos/index.html.
Ehrenfeucht, A., and Haussler, D. (1988). A new distance metric on strings computable in linear time. Discrete Applied Mathematics, 20, 191-203.
Fellegi, I. P., and Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64, 1183-1210.
Frakes,W. B., and Baeza-Yates, R. (Eds.). (1992). Information retrieval: Data structures and algorithms. Upper Saddle River, NJ: Prentice-Hall.
Gill, L. E. (2001). Methods for automatic record matching and linking in their use in national statistics. GSS Methodology Series, NSMS25. Ofce for National Statistics,
UK.
Gu, L., and Baxter, R. (2004). Adaptive ltering for efcient record linkage. In Proc. siam 2004 international conference on data mining, orlando, orida.
Guseld, D. (1997). Algorithms on Strings, Trees, and Sequences. Cambridge University Press.
Hall, P. A. V., and Dowling, G. R. (1980). Approximate string matching. ACM Computing Surveys, 12, 381-402.
Haslinger, A. (2004). Data matching for the maintenance of the austrian business register. Austrian Journal of Statistics, 33, 55-67.
Hirschberg, D. S. (1977). Algorithms for the longest common subsequence problem. Journal of the ACM, 24, 664-675.
Jamieson, E., Roberts, J., and Browne, G. (1995). The feasibility and accuracy of anonymized record linkage to estimate shared clientele among three health and social
service agencies. Methods of Information in Medicine, 34, 371-377.
Levenstein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl., 10, 707-710.
Porter, E., and Winkler, W. (1997). Approximate string comparison and its effect on an advanced record linkage system. RR97-02, U.S. Bureau of the Census. (Available
at http://www.census.gov/srd/www/byyear.html)
Roberti, P. (2004). International research into developing integrated and systematized information systems (eisis) for eu business policy impact analysis. Austrian Journal
of Statistics, 33, 3-33.
Ukkonen, E. (1985). Algorithms for approximate string matching. Information and Control, 64, 100-118.
Weghofer, E. (2004). Beurteilung ausgewählter Stringvergleichsalgorithmen zur Eignung für Record Linkage an Hand einer empirischen Datenbank. Unpublished master's thesis, Vienna University of Economics and Business Administration, Department
of Statistics.
Winkler, W. (1985). Preprocessing of lists and string comparison. In B. Kilss and W. Alvey (Eds.), Record Linkage Techniques (p. 181-187). FCSM, Washington,
DC.
Winkler,W. (1990). String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In Proc. Section on Survey Research Methods (p. 354-359). American Statistical Association.
Winkler, W. (1995). Matching and record linkage. In B. Cox and et al. (Eds.), Business Survey Methods (p. 355-384). New York: J. Wiley.
Winkler,W. (1999). The state of record linkage and current research problems. RR99-04, U.S. Bureau of the Census. (http://www.census.gov/srd/www/byyear.html)
Downloads
Published
Issue
Section
License
The Austrian Journal of Statistics publish open access articles under the terms of the Creative Commons Attribution (CC BY) License.
The Creative Commons Attribution License (CC-BY) allows users to copy, distribute and transmit an article, adapt the article and make commercial use of the article. The CC BY license permits commercial and non-commercial re-use of an open access article, as long as the author is properly attributed.
Copyright on any research article published by the Austrian Journal of Statistics is retained by the author(s). Authors grant the Austrian Journal of Statistics a license to publish the article and identify itself as the original publisher. Authors also grant any third party the right to use the article freely as long as its original authors, citation details and publisher are identified.
Manuscripts should be unpublished and not be under consideration for publication elsewhere. By submitting an article, the author(s) certify that the article is their original work, that they have the right to submit the article for publication, and that they can grant the above license.