ANU Computer Science Technical Reports

TR-CS-09-01


Peter Christen, Ross Gayler, and David Hawking.
Similarity-Aware Indexing for Real-Time Entity Resolution.
August 2009.

[POSTSCRIPT (290891 bytes)] [PDF (279460 bytes)]


Abstract: Entity resolution, also known as data matching or record linkage, is the task of identifying records from several databases that refer to the same entities. Traditionally, entity resolution has been applied on static databases, for example to find records that relate to the same patient in different health databases. Most research in entity resolution has concentrated on either improving the matching quality, making entity resolution scalable to very large databases, or reducing the manual efforts required throughout the resolution process. Increasingly, however, many organisations are faced with the challenge of having large databases that contain entities, and a stream of query records that have to be matched with these databases in real-time, such that the best matching records are retrieved. Example applications include online law enforcement and national security databases, public health surveillance and emergency response systems, financial verification systems, and online retail stores.

In this paper, a novel inverted index based approach for real-time entity resolution is presented. At build time, similarities between attribute values are computed and stored to support the fast matching of records at query time. The presented approach differs from other recently developed approaches to approximate querying, in that it allows any similarity comparison function, and any `blocking' function, both possibly domain specific, to be incorporated.

Experimental results on a large real-world database indicate that the total size of all data structures of this novel index approach grows sub-linearly with the size of the database, and that it allows matching of query records in sub-second time, more than two orders of magnitude faster than a traditional entity resolution index approach.


Technical Reports <Technical-DOT-Reports-AT-cs-DOT-anu.edu.au>
Last modified: Tue May 31 12:56:01 EST 2011