6.1.4 Word Spilling

Word spilling happens when data is entered into input fields with fixed length and continuous typing automatically continues in the next field if a field is full. For example, if a given name field with maximal length of 10 characters is given, and a surname field with 20 characters, the name 'maria louisa miller' would be stored as given name 'maria loui' and surname 'sa miller'. To check for word spilling can be a successful data cleaning step if a data set contains such data.

Word spilling concatenates words at the end and beginning of fields and then checks if such a concatenated word is known, i.e. if it is listed in one of the available look-up tables. If so, the concatenated word is kept, otherwise (i.e. if the word is not known) a whitespace character is inserted between the two original words.

Note: It can be argued, cogently, that in circumstances where the street address is already segmented into components (such as wayfare (street) number, wayfare name, locality (suburb or town) and postal code, it does not make much sense to concatenate these components and then try to parse back into individual components again. Future versions of Febrl will add support for standardising such already-segmented data. However, in real life, things are often not so clear cut, and often data items are entered in the wrong fields or spill over from one field to the next. In these circumstances, it may be advantageous to re-combine all the address and/or names elements and re-segment them in the data cleaning and standardisation process.