10.2.3 Fuzzy Matching Engine

Febrl's geocode matching engine is based on the G-NAF inverted index data, and takes a rule-based approach to find an exact match or alternatively one or more approximate matches. It's input is one or many cleaned and standardised user record(s).

The matching engine tries to find an exact match first, but if none can be found it extends its search to neighbouring postcode and suburb regions. First direct neighbouring regions (level 1) are searched, then direct and indirect neighbouring regions (level 2), until either an exact match or a set of approximate matches can been found. In the latter case, either an average location over all the found matches is returned, or a ranked (according to a matching weight) list of possible matches. The following steps explain in more detail (but still on a high conceptual level) how the matching engine works.

  1. Find the set of address level matches (using street number and suffix) and the set of street level matches (using street name and type). Street names are first searched using exact string comparison, and if no street name match can be found approximate string matching will be applied, allowing to find matches with typographical errors.
  2. Set the neighbour search level neighbour_level to 0 (no neighbouring regions are searched).
  3. Find the locality match set (using locality name, qualifier and postcode) according to the current value of neighbour_level. Locality names are first searched using exact string comparison, and if no locality name match can be found approximate string matching will be applied, allowing to find matches with typographical errors. Postcodes are only used in the matching process if they result in the same matches as the locality names (i.e. if they contribute to these matches). If postcodes result in different matches, they are not used in the final match (this is because postcodes are not fixed locality regions, but can be changed by Australia Post at any time).
  4. Check for matches between locality and street level, and if they exist reduce the street level matches using the locality matches, and set a flag street_locality_match to True (otherwise set it to False).
  5. Check for matches between locality and address level, and if they exist reduce the address level matches using the locality matches, and set a flag address_locality_match to True (otherwise set it to False).
  6. Check for matches between street and address level, and if they exist reduce the address level matches using the street matches, and set a flag address_street_match to True (otherwise set it to False).
  7. Now a decision needs to be made (if a match has been found or if the neighbour_level has to be increased), according to the values of the matching flags. A match has been found if all three match flags are True, or if there was no address level match and the street_locality_match flag is True, or if there was no street level match and the address_locality_match flag is True, or if there was no locality level match and the address_street_match flag is True. If no match has been found then increase the neighbour_level (up to a maximum of 2) and jump back to step 3.
  8. If matching records have been found, try to refine the match set using unit, flat and building (or property) information (if such information is available in the user input record).
  9. Combine the matches from the different levels (basically get their intersections), and set the final match_level (to either address, street or locality).
  10. Retrieve the coordinates of the matches from the corresponding G-NAF geocode index (address site, street/locality or locality). If one match has been found retrieve it's coordinates and return them together with the G-NAF PID (persistent identifier). If more than one match were found check if they are within a small area (as defined by the user). If so, average the coordinates and return the averaged location together with the list of G-NAF PIDs. If the distances between the matches were large, return a many match (no coordinates, a list of G-NAF PIDs only).
  11. In some cases a match at a certain level was found, but the corresponding entry is not in the G-NAF geocode index or it doesn't contain coordinates. In this case go to the next higher level (e.g. from address to street, or from street to locality) and try to retrieve their coordinates (back to step 10).
  12. If no match was found return a 'no_match' match status.
Geocoding of multiple addresses is an iterative process where each record is first cleaned and standardised, then geocoded and written into an output data set with coordinates and a match status added to each record.