9.2 Field Comparison Functions

The heart of the record linkage process consists of the comparison of fields from individual records. These field comparison functions return the basic matching weights that are stored in a weight vector for each record pair that is compared. The weight vectors are then given to a classifier - like the classical Fellegi and Sunter [13] approach - to calculate a matching decision (match, non-match or possible match).

The Febrl system contains a number of different FieldComparator functions, implemented in the module comparison.py. The field comparators allow various comparisons of strings, numbers, dates, ages and times.

The following arguments need to be given to all field comparison functions when they are initialised:

The values for all M- and U-probabilities must be between 0.0 and 1.0. It is possible to re-set the probabilities for a field comparator at any time using the method set_probabilities().

The agreement and disagreement weights are computed using the M- and U-probabilities, as described in the record linkage literature (see for example [13,15,23,37]).

\begin{eqnarray*}
agree\_weight = log_2 \left( \frac{m\_probability}
{u\_probability} \right)
\end{eqnarray*}


\begin{eqnarray*}
disagree\_weight = log_2 \left( \frac{1.0 - m\_probability}
{1.0 - u\_probability} \right)
\end{eqnarray*}


Frequency look-up tables can be used with several of the field comparison functions, using the following optional arguments.

The calculation of frequency dependent weights is described in detail in Section 9.2.1 below.

Two additional arguments to all field comparison functions are name and description which can be used to document the functionality of a field comparator.



Subsections