ANU Computer Science Technical Reports
TR-CS-07-03
Peter Christen.
Towards Parameter-free Blocking for Scalable Record
Linkage.
August 2007.
[POSTSCRIPT (326429 bytes)] [PDF (265895 bytes)] [DSpace archive]
Abstract: Linking or matching databases is becoming
increasingly important in many data mining projects, as linked data can
contain information that is not available otherwise, or that would be too
expensive to collect. A main challenge when linking large databases is the
complexity of the linkage process: potentially each record in one database
has to be compared with all records in the other database. Various
techniques, collectively know as `blocking', have been developed to deal with
this quadratic complexity. Most of these techniques require several
parameters to be set by the user in order to achieve good results. In this
paper we evaluate six blocking techniques within a common framework with
regard to the number and quality of the candidate record pairs generated. We
propose a modification to two existing techniques that reduces the variance
in the quality of the blocking results over a range of parameter values,
enabling more robust, practical record linkage without the need of time
consuming manual parameter tuning.
Technical Reports <Technical-DOT-Reports-AT-cs-DOT-anu.edu.au>
Last modified: Tue May 31 12:56:01 EST 2011