9.5.2 Flexible Classifier

This flexible classifier allows different methods to be used to calculate the final matching weight for a weight vector (as calculated by a RecordComparator as discussed in Section 9.3). Similar to the Fellegi and Sunter classifier, two thresholds are used to classify a record pair into one of the three classes links, non-links or possible links. The results of a classification are stored in a data structure, which can then be used to produce various output forms as presented in Chapter 11.

Instead of simply summing all weights in a weight vector, this flexible classifier allows a flexible definition of the final weight calculation by defining tuples containing a function and elements of the weight vector upon which the function is applied. The final weight is then calculated using another function that needs to be defined by the user.

The following functions can currently be used within the flexible classifier:

Weight vector elements are selected by giving the desired indexes (starting from 0) in a Python list, e.g. [0,1,4] selects the first two and the fifth field comparison weights. When initialising a flexible classifier, the argument calculate needs to be set to a list made of tuples with functions and weight vector elements as shown in the example below.

The final weight can then be calculated by again using one of the functions 'min', 'max', 'add', 'mult', and 'avrg' given above. The argument final_funct has to be used for this when a flexible classifier is initialised.

Let's make an example. Assume we have weight vectors that contain weights calculated by eight different field comparison functions (as explained in Section 9.2). We would like to calculate the final weight as being the average of 1) the sum of the first four weights, 2) the maximal value of weights five and six, and 3) the minimum of weights seven and eight. The corresponding flexible classifier can then be initialised as shown in the following example code.

# ====================================================================

flex_classifier = FlexibleClassifier(name = 'My flexible classifier',
                                dataset_a = mydata_1,
                                dataset_b = mydata_2,
                          lower_threshold = 10.0,
                          upper_threshold = 50.0,
                                calculate = [('add', [0,1,2,3]),
                                             ('max', [4,5]),
                                             ('min', [6,7])],
                              final_funct = 'avrg')

Note that it is possible to use a weight in more than just one of the calculated intermediate weights. Alternatively it is also possible not to use a weight. It is important though that weight vectors must have as much elements as are used in the calculate definitions (i.e. one should not use definitions with indexes larger than the lengths of the weight vectors).

When a flexible classifier is initialised, the following arguments need to be given.