6.1.1 Step 1: Cleaning

The input to the data cleaning routine is a string that contains an input component, i.e. either a name or an address. First, all letters in such a string are converted into lower case. Then, a correction list of replacement strings is used to replace certain words, abbreviations and characters with others. For example, given the example correction list in Table 6.1, variations of known as, such as 'a.k.a.' or 'aka' are all replaced with a standard string 'known as'. A correction list is loaded from a correction list file (see Section 14.1 for the details of the formats of such files). Each entry in such a list is made of a string (that can be one or more words, or a simple character) and a corresponding replacement string. For each entry in the list, the input string is scanned and if an original string is found it is replaced by the corresponding replacement string.


Table 6.1: Example correction list.
\begin{table}
\begin{center}
\begin{tableii}{c\vert c}{textrm}{Original}{Replace...
... '('}{ '\vert'}
\lineii{ ':'}{ ' '} \hline
\end{tableii}\end{center}\end{table}


Each correction list is sorted and processed by decreasing length of the original string, i.e. long original strings are searched for and replaced first. In the example correction list given below, the entry ' knownas ' would be searched first and if found it would be replaced by ' known as '. Note the spaces around some of the entries. They are important, specially for short words, like ' na ' (not available). If the entry would be 'na' only, each occurrence of 'na' in the input would be replaced by a single space ' '. The name 'bernadette' would thus be converted into 'ber dette'.

The output of the data cleaning routine is a new string where all occurrences of substrings found in the correction list have been replaced with the corresponding replacement strings. Note that the length of the output string might be different from the input string.