12.3 File Analysis Program 'fileanalysis.py'

This program can be used to analyse a data file (currently a text file in comma separated values (CSV) only) in order to get information about its content and quality. Frequency information is collected for all columns, and various statistics are then printed and stored in a statistics results file.

The fileanalysis.py program can be started from the command line with the following argument list

python fileanalysis.py  input_file   steps   num_header_lines   num_columns   stats_file

The needed arguments are

Three more settings can be configured within the fileanalysis.py program itself.

For each column in the input file, the following information and statistics are calculated and printed.

Finally three summary tables will be printed (and stored in the results file if one is given in the arguments list). The first summary table contains the column names (if a header line is given), the number of different values in the columns, the number of records (or lines) in the columns with a missing (or empty) value, the average frequency of the values and their standard deviation, and the type (digits only, numbers only, digit and numbers, or various characters).

The second summary table presents the column names (if a header line is given) and their quantiles as defined in the QUANT_LIST.

The third summary table presents the column names and calculations on the suitability of the columns for blocking. If a column contains less than MISS_PERC_THRES (in percentage) records with missing (or empty) values then an expected number of record pairs is calculated by summing over the frequencies of all values in a column:

\begin{displaymath}
num\_blocks = \sum freq[i] \times (freq[i] - 1)
\end{displaymath}

A small example summary table is shown below. Note that the column 'ident_num' contains values that occur only once (i.e. with maximum frequency 1), therefore the calculated number of record pair comparisons is zero.

Column names  Suitability
----------------------------------------------------------------------------------
        year  11 diff. values, resulting in 84272215518 record pair comparisons
   ident_num  962776 diff. values, resulting in 0 record pair comparisons
     surname  101301 diff. values, resulting in 1608892180 record pair comparisons
              Note column contains 2.23% (21483) records with missing values
  wayfarenum  16.46% (158478) records with missing values  (not suitable)