ANU Computer Science Technical Reports

TR-CS-07-03


Peter Christen.
Towards Parameter-free Blocking for Scalable Record Linkage.
August 2007.

[POSTSCRIPT (326429 bytes)] [PDF (265895 bytes)] [DSpace archive]


Abstract: Linking or matching databases is becoming increasingly important in many data mining projects, as linked data can contain information that is not available otherwise, or that would be too expensive to collect. A main challenge when linking large databases is the complexity of the linkage process: potentially each record in one database has to be compared with all records in the other database. Various techniques, collectively know as `blocking', have been developed to deal with this quadratic complexity. Most of these techniques require several parameters to be set by the user in order to achieve good results. In this paper we evaluate six blocking techniques within a common framework with regard to the number and quality of the candidate record pairs generated. We propose a modification to two existing techniques that reduces the variance in the quality of the blocking results over a range of parameter values, enabling more robust, practical record linkage without the need of time consuming manual parameter tuning.
Technical Reports <Technical-DOT-Reports-AT-cs-DOT-anu.edu.au>
Last modified: Tue May 31 12:56:01 EST 2011