10.3 Program 'process-gnaf.py'

In order to be able to efficiently use the G-NAF data set for geocoding, the necessary G-NAF files need to be cleaned, pre-processed and indexed so that matching records (and their longitude and latitude) can be retrieved in an efficient and fast way.

The program process-gnaf.py (available in the geocode directory) does exactly this pre-processing of the G-NAF files, which are assumed to be available as CSV (comma separated values) text files. The main computations routines used in process-gnaf.py are implemented in the module gnaffunctions.py which is available in the directory geocode.

As discussed in Section 10.1, G-NAF consists of many files containing the normalised address, street and locality data, geocoding information (for address sites, streets and localities), as well as various alias information.

All settings for process-gnaf.py need to be specified by the user within the program itself. The main processing flags or switches that control what kind of pre-processing is performed are discussed first followed by all other process-gnaf.py settings.

At the beginning of process-gnaf.py - after the license header, module description and module imports - is a code section which contains a number of flags (or switches) which can be set to True or False in order to enable or disable the processing of parts of the G-NAF files. The following flags can be set.

The following code section shows the above described flags as taken from process-gnaf.py.

# ====================================================================
# Some flags that control the G-NAF pre-processing, set to either True
# or False

check_pid_uniqueness = False

save_pickle_files = True    # Save inverted indexes into binary Python
                            # pickles
save_shelve_files = True    # Save inverted indexes into binary Python
                            # shelves
save_text_files   = True    # Save inverted indexes into text files

process_coll_dist_files = True    # Process collection district files

process_locality_files = True     # Process the G-NAF locality related
                                  # files
process_street_files = True       # Process the G-NAF street related
                                  # files
process_address_files = True      # Process the G-NAF address related
                                  # files

create_reverse_lookup_shelve = True  # Create one large shelve to be
                                     # used for reverse look-ups (i.e.
                                     # given one or more PID find the
                                     # correspnding G-NAF records)
create_gnaf_address_csv_file = True  # Create one large CSV file with
                                     # all G-NAF addresses (values
                                     # merged from several files) and
                                     # their locations

gnaf_address_csv_file_name = 'gnav_address_geocodes.csv'
                                             # Corresponding file name

Several other important settings follow further down in the process-gnaf.py program. They have to be set by the user before a G-NAF pre-processing process can be started.

Once all the settings in process-gnaf.py are adjusted according to a user's needs the G-NAF pre-processing can be started from the command line with:

python process-gnaf.py
Once all G-NAF files are processed into binary inverted index files (pickle and/or shelve files) they will be available in the defined G-NAF output directory, and can then be used by the Febrl geocoding system as explained in Section 10.4.

Warning: While processing G-NAF address related files the program process-gnaf.py uses a large amount of main memory, for example around 3.5 Gigabytes (for processing around 4 million records from New South Wales) if all the pre-processing is done in one pass (if all flags explained above are set to true). If your machine has a smaller amount of main memory swapping (or trashing) will occur increasing the pre-processing times tremendously and slowing down your machine. See Section 10.3.1 for more details.



Subsections