ANU Computer Science Technical Reports
TR-CS-09-01
Peter Christen, Ross Gayler, and David Hawking.
Similarity-Aware Indexing for Real-Time Entity
Resolution.
August 2009.
[POSTSCRIPT (290891 bytes)] [PDF (279460 bytes)]
Abstract: Entity resolution, also known as data
matching or record linkage, is the task of identifying records from several
databases that refer to the same entities. Traditionally, entity resolution
has been applied on static databases, for example to find records that relate
to the same patient in different health databases. Most research in entity
resolution has concentrated on either improving the matching quality, making
entity resolution scalable to very large databases, or reducing the manual
efforts required throughout the resolution process. Increasingly, however,
many organisations are faced with the challenge of having large databases
that contain entities, and a stream of query records that have to be matched
with these databases in real-time, such that the best matching records are
retrieved. Example applications include online law enforcement and national
security databases, public health surveillance and emergency response
systems, financial verification systems, and online retail stores. In
this paper, a novel inverted index based approach for real-time entity
resolution is presented. At build time, similarities between attribute values
are computed and stored to support the fast matching of records at query
time. The presented approach differs from other recently developed approaches
to approximate querying, in that it allows any similarity comparison
function, and any `blocking' function, both possibly domain specific, to be
incorporated.
Experimental results on a large real-world database
indicate that the total size of all data structures of this novel index
approach grows sub-linearly with the size of the database, and that it allows
matching of query records in sub-second time, more than two orders of
magnitude faster than a traditional entity resolution index approach.
Technical Reports <Technical-DOT-Reports-AT-cs-DOT-anu.edu.au>
Last modified: Tue May 31 12:56:01 EST 2011