6.1.2 Step 2: Tagging

After an input component string has been cleaned, the next step is to split it at space boundaries into a list of words, numbers and possible separators. The name input 'doctor peter paul miller' for example is split into a list containing the four words ['doctor', 'peter', 'paul', 'miller']. All leading and trailing spaces are removed from the list elements.

Using various look-up tables (and some hard-coded rules), each element of this list is then assigned one or more tags. The list of possible tags can be found in Appendix B. The hard-coded rules include, for example, tagging an element as a hyphen, a comma, a slash, a number or an alphanumeric word, while most of the other tags (titles, given names, surnames, postcode, locality names, wayfare and unit types, countries, etc.) are assigned to words if they are listed in one of the look-up tables provided. If a word (or a word sequence) is found in a look-up table, it is not only tagged, but it is also replaced by it's corresponding corrected entry in the look-up table.

It is possible that a word is listed in more than one look-up table. Consequently, it will be assigned more than one tag (see for example the name word 'peter' below). Words which are not found in any look-up table and which do not match any of the hard-coded tagging rules are assigned the 'UN' (unknown) tag. A title word like 'doctor' for example is assigned a title tag 'TI' and it will be replaced with the word 'dr', as are the words 'md' and 'phd' (using the example look-up table shown in Table 6.2).

The look-up tables are searched using a greedy matching algorithm, which searches for the longest tuple of elements which match an entry in the look-up tables. For example, the tuple of words ('macquarie','fields') will be matched with an entry in a look-up table for the locality 'macquarie fields', rather than with the shorter entry 'macquarie' from the same look-up table. As another example, 'st marys' is tagged as 'LN' (locality name) and replaced with the string 'st_marys', rather than the 'st' part of 'st marys' being tagged as 'WT' (wayfare type) and being replaced with 'street', and 'mary' being tagged as 'UN' (assuming this word is not found in any look-up tables for address words).


Table 6.2: Example title look-up table (tag TI).
\begin{table}
\begin{center}
\begin{tableii}{c\vert c}{textrm}{Original}{Replace...
...ses'}{'ms'}
\lineii{'mister'}{'mr'} \hline
\end{tableii}\end{center}\end{table}


While the input to a tagging routine is a cleaned string, the output is a list of elements and the corresponding list of tags. For the example input name string

'doctor peter paul miller'
a possible output could be
Word list: ['dr', 'peter', 'paul', 'miller']
Tag list:  ['TI', 'GM/SN',  'GM',   'SN'  ]
assuming that 'peter' is listed in both the look-up tables for male given names ('GM' tag) and surnames ('SN' tag).