8.1 Program 'tagdata.py'

The program tagdata.py is used to create tagged training records selected from the original data set. Each training record is selected randomly from the input data set, cleaned and tagged in the same way as done in the data cleaning and standardisation process within the standardisation.py module. The tag sequence (or sequences) are written to the training file together with the commented original record.

The program is called from the command line with:

python tagdata.py

All settings are within the program as shown in the code example at the end of this section. The following list describes the different configuration settings that must be defined.

If the option 'hmm_file_name' is defined (set to the name of a HMM file), the selected training records are given both tags and (hidden) states, as tag:hmm_state pairs, using the HMM. This allows a semi-automatic training process, where the user only has to inspect the output file and change HMM states for cases that are standardised incorrectly. This mechanism reduces the time needed to create enough records to train a HMM training.

The format of the output file is as follows. The selected original input records (name or address component) are written to the output file as comment lines with a hash '#' character and the line number from the input file (starting with zero) at the beginning of a line. After each input data line, one or more lines with tag sequences follows.

The user has to manually inspect the output file and delete (or comment out) all lines with tags that are not correct, and insert a HMM state name for each observation tag in a sequence (or modify the HMM state given).

For example, if we have the three selected input records (name component)

'dr peter baxter dea'
'miss monica mitchell meyer'
'phd tim william jones harris'

they will be processed (depending on the available look-up tables) and written into the output training file as

# 0: |dr peter baxter dea|
  TI:, GM:, GM:, GF:
  TI:, GM:, SN:, GF:
  TI:, GM:, GM:, SN:
  TI:, GM:, SN:, SN:
# 1: |miss monica mitchell meyer|
  TI:, UN:, GM:, SN:
  TI:, UN:, SN:, SN:

# 2: |phd tim william jones harris|
  TI:, GM:, GM:, UN:, SN:

If a HMM file is defined in option 'hmm_file_name' the output will be something like (again depending on the available look-up tables)

# 0: |dr peter baxter dea|
# TI:titl, GM:gname1, GM:gname2, GF:sname1
# TI:titl, GM:gname1, SN:sname1, GF:sname2
  TI:titl, GM:gname1, GM:gname2, SN:sname1
# TI:titl, GM:gname1, SN:sname1, SN:sname2

# 1: |miss monica mitchell meyer|
  TI:titl, UN:gname1, GM:sname1, SN:sname2
# TI:titl, UN:gname1, SN:sname1, SN:sname2

# 2: |phd tim william jones harris|
  TI:titl, GM:gname1, GM:gname2, UN:sname1, SN:sname2

The following code example shows the part of the tagdata.py program that needs to be modified by the user according to her or his needs.

# ====================================================================
# Define a project logger

init_febrl_logger(log_file_name = 'febrl-tagdata.log',
                     file_level = 'WARN',
                  console_level = 'INFO',
                      clear_log = True,
                parallel_output = 'host')

# ====================================================================
# Set up Febrl and create a new project (or load a saved project)

tag_febrl = Febrl(description = 'Data tagging Febrl instance',
                   febrl_path = '.')

tag_project = tag_febrl.new_project(name = 'Tag-Data',
                             description = 'Data tagging module',
                               file_name = 'tag.fbr')


# ====================================================================
# Define settings for data tagging

# Define your original input data set - - - - - - - - - - - - - - - -
#
input_data = DataSetCSV(name = 'example1in',
                 description = 'Example data set 1',
                 access_mode = 'read',
                header_lines = 1,
                   file_name = 'dsgen'+dirsep+'dataset1.csv',
                      fields = {'rec_id':0,
                                'given_name':1,
                                'surname':2,
                                'street_num':3,
                                'address_part_1':4,
                                'address_part_2':5,
                                'suburb':6,
                                'postcode':7,
                                'state':8,
                                'date_of_birth':9,
                                'soc_sec_id':10},
              fields_default = '',
                strip_fields = True,
              missing_values = ['','missing'])

# Define block of records to be used for tagging - - - - - - - - - - -
#
start_rec_number = 0
end_rec_number =   1000 # input_data.num_records

# Define number of records to be selected randomly - - - - - - - - - -
#
num_rec_to_select = 500

# Define name of output data set - - - - - - - - - - - - - - - - - - -
#
output_file_name = 'tagged-data.csv'
# Component: Can either be 'name' or 'address' - - - - - - - - - - - -
#
tag_component = 'address'

# Define a list with field names from the input data set in the  - - -
# component (name or address - name in this example)
#
tag_component_fields = ['street_num', 'address_part_1',
                        'address_part_2', 'suburb', 'postcode',
                        'state']

# Define if word spilling should be checked or not  - - - - - - - - - 
#
check_word_spilling = True  # Set to True or False

# Define the field separator - - - - - - - - - - - - - - - - - - - - -
#
field_separator = ' '

# Use HMM for tagging and segmenting  - - - - - - - - - - - - - - - -
# (set to address of a HMM file or None)
#
hmm_file_name = 'hmm'+dirsep+'address-absdiscount.hmm'

# Retag an existing training file  - - - - - - - - - - - - - - - - - -
# - Note that re-tagging is only possible if a HMM file name is given
#   as well
# - If the retag file name is defined, the start and end record
#   numbers as defined above are not used, instead the record numbers
#   in the re tag file are used.
#
retag_file_name = None  # Set to name of an existing training file or
                        # to None

# Write out frequencies into a file  - - - - - - - - - - - - - - - - -
#
freqs_file_name = 'tagged-data-freqs.txt' # Set to a file name or None

# Define and load lookup tables - - - - - - - - - - - - - - - - - - -
#
name_lookup_table = TagLookupTable(name = 'Name lookup table',
                                default = '')
name_lookup_table.load(['data'+dirsep+'givenname_f.tbl',
                        'data'+dirsep+'givenname_m.tbl',
                        'data'+dirsep+'name_prefix.tbl',
                        'data'+dirsep+'name_misc.tbl',
                        'data'+dirsep+'saints.tbl',
                        'data'+dirsep+'surname.tbl',
                        'data'+dirsep+'title.tbl'])

name_correction_list = CorrectionList(name = 'Name correction list')
name_correction_list.load('data'+dirsep+'name_corr.lst')

address_lookup_table = TagLookupTable(name = 'Address lookup table',
                                   default = '')
address_lookup_table.load(['data'+dirsep+'country.tbl',
                           'data'+dirsep+'address_misc.tbl',
                           'data'+dirsep+'address_qual.tbl',
                           'data'+dirsep+'institution_type.tbl',
                           'data'+dirsep+'locality_name_act.tbl',
                           'data'+dirsep+'locality_name_nsw.tbl',
                           'data'+dirsep+'post_address.tbl',
                           'data'+dirsep+'postcode_act.tbl',
                           'data'+dirsep+'postcode_nsw.tbl',
                           'data'+dirsep+'saints.tbl',
                           'data'+dirsep+'territory.tbl',
                           'data'+dirsep+'unit_type.tbl',
                           'data'+dirsep+'wayfare_type.tbl'])

address_correction_list = CorrectionList(name = 'Address corr. list')
address_correction_list.load('data'+dirsep+'address_corr.lst')