13.1 COL Data Set Implementation

Text files with fields with fixed column width are commonly used. Fields are normally specified with the start column and the field width, or with a start and an end column. Febrl uses the start column (starting with zero) and field width (or length) format to define fields. The file extension of such files is often '.txt'. This data set implementation allows sequential access only.

The fields attribute of a COL data set must be a dictionary where the keys are the field names and the values are tuples with start column (starting from $0$) and field width.

Additional attributes (besides the general data set attributes as described above) for a COL data set are

Note: When a COL data set is initialised in read access mode, it is possible to define fields that are overlapping, and gaps between fields are also possible. For example
     fields = {'hospitalcode':(10,4),
                       'year':(14,8),
               'yearhospcode':(10,12),
                       'name':(30,20),
                    'address':(60,50)}

However, for COL data sets initialised in write or append mode, the field definitions must not be overlapping nor must there be gaps between field definitions.

Note: In its current implementation, a COL data set can only consist of one underlying COL text file. The handling of multiple files as one data set will be implemented in a future version of Febrl.

The following example shows how to initialise a COL data set and how to access it in read mode. It is assumed that the dataset.py module has been imported using the import dataset command.

# ====================================================================

mydata = dataset.DataSetCOL(name = 'hospital-data',
                     description = 'Hospital data from 1990-2000',
                    access_right = 'read',
                    header_lines = 1,
                       file_name = './data/hospital.txt',
                          fields = {'year':(0,4),
                                    'surname':(4,10),
                                    'givenname':(14,10),
                                    'dob':(24,8),
                                    'address':(32,30),
                                    'postcode':(62,4),
                                    'state':(66,3)},
                  fields_default = '',
                    strip_fields = True,
                  missing_values = ['','missing'])

print mydata.num_records  # Print total number for records

first_record = mydata.read_record()  # Returns one record

hundred_records = mydata.read_records(0,100)  # Read 100 records
ten_records = mydata.read_records(2000,10)  # Read another 10 records

mydata.finalise()  # Close file, finalise access to data set