9. Record Linkage and Deduplication

Record linkage is the task of comparing records and deciding whether they are a match (i.e. determining if they represent the same entity) or a non-match (i.e. determining if they represent different entities), or, if this decision can not be made by the record linkage system, using human intervention (clerical review) to decide the matching status of a record pair. Assuming that cleaned and standardised records are available, the process of linking records or the deduplication of a data set consists of several steps.

  1. One or more (blocking) indexes need to be build with the aim of grouping together records that potentially match and thus reducing the huge number of possible comparisons. While this grouping should reduce the number of comparisons made as much as possible, it is important that no potential match is overlooked because of the indexing process.
  2. After index(es) are built, records within the same index block are compared field by field using field comparison functions, resulting in a weight vector for each record pair compared.
  3. These weight vectors are then given to a classifier (like the classical Fellegi and Sunter [13] approach classifier) that decides if a record pair constitutes a match, non-match or a possible match.

The following sections describe in detail how a linkage or deduplication process can be defined using a project object (as shown in the example at the end of Chapter 5), and how its necessary components (such as indexes, field comparison functions and classifiers) can be defined. Indexing is the topic of Section 9.1. All field comparison functions available are described in Section 9.2, and the initialisation of a record comparator is presented in Section 9.3. Section 9.4 includes example code that shows both field comparison functions and record comparator initialisation. The definition of classifiers is then discussed in Section 9.5, and the chapter concludes with Section 9.6 which presents how to define and start a linkage or deduplication process, respectively.



Subsections