9.2.5 Encoded String Comparison  'FieldComparatorEncodeString'

Phonetic name encoding is traditionally used to create blocking variables in the record linkage process, but it can also be used to compare strings. Several algorithms for phonetic encoding are implemented in the encode.py module.

The encoded string comparison function compares the two fields (or field lists) given to it as encoded strings, and returns the agreement weight if both strings are encoded the same way, otherwise the disagreement weight is returned.

The encoding method has to be selected with the encode_method argument. The following methods are currently implemented in Febrl:

All of these encodings are particularly sensitive to errors in the first letter of a string. Therefore, an additional argument to the encoded string comparator is reverse which can be either set to False or True. In the latter case, the strings are reversed first before they are encoded. The default value for reverse is False.

The maximum length of the codes calculated can be set with the argument max_code_length, which has a default value of 4.

If a frequency table is given for this field comparator, the agreement weight will be calculated using the frequency of the value in the input fields that are compared, as described in Section 9.2.1. Frequency dependent weights will also be used to calculate a partial agreement weight.