11.1 Record Pair One-To-One Assignment Restrictions

In many linkage projects one is often only interested in the best record pair matches, and one-to-one assignments need to be forced. The simplest way to do this would be to use a greedy algorithm, but this would result in some assignments being not optimal due to the transitive closure problem.

For example, assume a record $A$ is linked to a record $B$ with a match weight of $42.21$, and record $A$ is also linked to a record $C$ with a weight of $39.01$, and record $B$ is linked to record $C$ with a weight of $44.98$. Assuming that the record pair list is sorted, a greedy assignment procedure will assign record $A$ to record $B$ (because the weight for pair $(A,B)$ is larger than for $(A,C)$), but then it can not assign record $C$ to record $A$ because record $A$ has already been assigned (with a lower weight than the $A$ to $C$ weight). Even if the record pair list is sorted according to the match weights the assignment will most likely not be optimal.

This linear sum assignment problem can be solved with special algorithms that find the optimal solution over all possible assignments. In Febrl we have implemented the Auction algorithm (in the lap.py module) as developed by Bertsekas [3] and others.

Our linear assignment procedure implementation takes as input the results calculated from a classifier (i.e. record pairs and their total weights), and optionally a threshold value (in which case all record pairs that have a weight lower than this threshold are not considered in the assignment procedure). A third argument is the process type which can be either deduplication or linkage. The reason for this is that in a deduplication process a record number can only be part of one assignment (i.e. record $X$ in two record pairs $(X,Y)$ and $(Z,X)$ is the same), while for a linkage process a record number $X$ in the first data set does not identify the same record as a record number $X$ in the second data set (i.e. record pair $(X,Y)$ and $(Z,X)$ can both be assigned).

The first step in the assignment procedure (after filtering out all record pairs with weights less than the given threshold) is to find record numbers that are independent, i.e. that do only appear in exactly one record pair, as they are not part of a multi-record assignment problem. Secondly, we extract all sub-sets of connected record pairs that build one assignment problem. Each of these sub-sets can be solved independently, which is then done in the third step using the Auction algorithm [3].

The output of the one-to-one assignment routine is a dictionary that contains the optimal record pair assignments and their corresponding matching weights.