Skip navigation
The Australian National University

Student research opportunities

Flexible household data generation

Project Code: CECS_907

This project is available at the following levels:
CS single semester, Masters

Keywords:

Data matching, data linkage, de-duplication, data cleaning, data quality, open source software, Python, synthetic data, data generator, personal information

Supervisor:

Assoc Professor Peter Christen

Outline:

A flexible data generator has been developed by us in collaboration with Fujitsu Laboratories (Japan) in 2012. This generator allows realistic personal information to be created, which can be used for the testing and evaluation of data mining and data matching techniques and systems.

One important extension of this data generator that so far has not been developed and implemented is the generation of household data, i.e. records that are about groups of people (couples, families, house mates, etc.) that live at the same address.

Goals of this project

The aim of this project is to develop and implement into prototype software functionalities that allow the creation of such realistic household data following various patterns and constraints, such as ages, gender, family roles (parents, children, etc.), and functions to modify groups of records in realistic ways (such as address changes when a family moves, children moving out, people moving together, etc.)

Requirements/Prerequisites

Knowledge in data mining, entity resolution, databases, and good skills in Python programming (or a related programming language).

Student Gain

The student working on this project will:
- become familiar with various data quality issues,
- learn about data matching techniques, specifically those with relevance to data
   generation (such as data pre-processing and cleaning, and approximate string
   comparisons),
- become familiar with the Python programming language and its
   object-oriented programming model,
- learn about open source software development,
- learn about the use of the Latex typesetting system and about
   technical writing, and
- learn about effectively presenting their work to a general audience.

Links

Co-supervisor Dinusha Vatsalan
Paper "Accurate synthetic generation of realistic personal information"

Contact:



Updated:  17 June 2013 / Responsible Officer:  JavaScript must be enabled to display this email address. / Page Contact:  JavaScript must be enabled to display this email address.