String Matching Techniques: An Empirical Assessment Based on Statistics Austria's Business Register

Michaela Denk; Peter Hackl; Norbert Rainer

doi:10.17713/ajs.v34i3.415

Authors

Michaela Denk ec3 Electronic Commerce Competence Center, Vienna
Peter Hackl University of Economics and Business Administration, Vienna
Norbert Rainer Statistics Austria, Vienna

DOI:

https://doi.org/10.17713/ajs.v34i3.415

Abstract

The maintenance and updating of Statistics Austria's business register requires a regularly matching of the register against other data sources; one of them is the register of tax units of the Austrian Federal Ministry of Finance. The matching process is based on string comparison via bigrams of enterprise names and addresses, and a quality class approach assigning pairs of register units into classes of different compliance (i.e., matching quality) based on bigram similarity values and the comparison of other matching variables, like the NACE code or the year of foundation.

Based on methodological research concerning matching techniques carried out in the DIECOFIS project, an empirical comparison of the bigram method and other string matching techniques was conducted: the edit distance, the Jaro algorithm and the Jaro-Winkler algorithm, the longest common subsequence and the maximal match were selected as appropriate alternatives and evaluated in the study.

This paper briey introduces Statistics Austria's business register and the corresponding maintenance process and reports on the results of the empirical study.

References

Apostolico, A., and Guerra, C. (1987). The longest common subsequence problem revisited. Algorithmica, 2, 315-336.

Baxter, R., Christen, P., and Churches, T. (2003). A comparison of fast blocking methods for record linkage. In Proceedings of the First Workshop on Data Cleaning, Record

Linkage, and Object Consolidation, 9th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Washington, DC.

Cohen,W.W., Ravikumar, P., and Fienberg, S. E. (2003). A comparison of string distance metrics for name-matching tasks. In Proceedings of the 18th International Joint

Conference Workshop on Information Integration on the Web.

Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM, 7, 171-176.

Denk, M. (2002). Statistical Data Combination: A Metadata Framework for Record Linkage Procedures. Unpublished doctoral dissertation, University of Vienna, Department

of Statistics and Decision Support Systems.

Denk, M., Fröschl, K., Hackl, P., and Rainer, N. (Eds.). (2004). Special issue on data integration and record matching. Austrian Journal of Statistics, 33, 1-264.

Denk, M., and Hackl, P. (2003). Data integration and record matching: An Austrian contribution to research in ofcial statistics. Austrian Journal of Statistics, 32, 305-321.

Denk, M., Inglese, F., and Calza, M. G. (2003). Assessment of different approaches for the integration of sample surveys. DIECOFIS Deliverable 1.2, ISTAT, Rome.

Denk, M., Inglese, F., and Oropallo, F. (2003). Report on statistical indicators for the assessment of multi-source databases. DIECOFIS Deliverable 1.3, ISTAT, Rome.

Denk, M., and Oropallo, F. (2002). Overview of the issues in multi-source databases. DIECOFIS Deliverable 1.1, ISTAT, Rome. DIECOFIS. (2003). DIECOFIS web site. http://petra1.istat.it/diecos/index.html.

Ehrenfeucht, A., and Haussler, D. (1988). A new distance metric on strings computable in linear time. Discrete Applied Mathematics, 20, 191-203.

Fellegi, I. P., and Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64, 1183-1210.

Frakes,W. B., and Baeza-Yates, R. (Eds.). (1992). Information retrieval: Data structures and algorithms. Upper Saddle River, NJ: Prentice-Hall.

Gill, L. E. (2001). Methods for automatic record matching and linking in their use in national statistics. GSS Methodology Series, NSMS25. Ofce for National Statistics,

UK.

Gu, L., and Baxter, R. (2004). Adaptive ltering for efcient record linkage. In Proc. siam 2004 international conference on data mining, orlando, orida.

Guseld, D. (1997). Algorithms on Strings, Trees, and Sequences. Cambridge University Press.

Hall, P. A. V., and Dowling, G. R. (1980). Approximate string matching. ACM Computing Surveys, 12, 381-402.

Haslinger, A. (2004). Data matching for the maintenance of the austrian business register. Austrian Journal of Statistics, 33, 55-67.

Hirschberg, D. S. (1977). Algorithms for the longest common subsequence problem. Journal of the ACM, 24, 664-675.

Jamieson, E., Roberts, J., and Browne, G. (1995). The feasibility and accuracy of anonymized record linkage to estimate shared clientele among three health and social

service agencies. Methods of Information in Medicine, 34, 371-377.

Levenstein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl., 10, 707-710.

Porter, E., and Winkler, W. (1997). Approximate string comparison and its effect on an advanced record linkage system. RR97-02, U.S. Bureau of the Census. (Available

at http://www.census.gov/srd/www/byyear.html)

Roberti, P. (2004). International research into developing integrated and systematized information systems (eisis) for eu business policy impact analysis. Austrian Journal

of Statistics, 33, 3-33.

Ukkonen, E. (1985). Algorithms for approximate string matching. Information and Control, 64, 100-118.

Weghofer, E. (2004). Beurteilung ausgewählter Stringvergleichsalgorithmen zur Eignung für Record Linkage an Hand einer empirischen Datenbank. Unpublished master's thesis, Vienna University of Economics and Business Administration, Department

of Statistics.

Winkler, W. (1985). Preprocessing of lists and string comparison. In B. Kilss and W. Alvey (Eds.), Record Linkage Techniques (p. 181-187). FCSM, Washington,

DC.

Winkler,W. (1990). String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In Proc. Section on Survey Research Methods (p. 354-359). American Statistical Association.

Winkler, W. (1995). Matching and record linkage. In B. Cox and et al. (Eds.), Business Survey Methods (p. 355-384). New York: J. Wiley.

Winkler,W. (1999). The state of record linkage and current research problems. RR99-04, U.S. Bureau of the Census. (http://www.census.gov/srd/www/byyear.html)