Student research opportunities
Flexible household data generation
Project Code: CECS_907
This project is available at the following levels:
CS single semester, Masters
Keywords:
Data matching, data linkage, de-duplication, data cleaning, data quality, open source software, Python, synthetic data, data generator, personal information
Supervisor:
Assoc Professor Peter ChristenOutline:
A flexible data generator has been developed by us in collaboration with Fujitsu Laboratories (Japan) in 2012. This generator allows realistic personal information to be created, which can be used for the testing and evaluation of data mining and data matching techniques and systems.
One important extension of this data generator that so far has not been developed and implemented is the generation of household data, i.e. records that are about groups of people (couples, families, house mates, etc.) that live at the same address.
Goals of this project
The aim of this project is to develop and implement into prototype software functionalities that allow the creation of such realistic household data following various patterns and constraints, such as ages, gender, family roles (parents, children, etc.), and functions to modify groups of records in realistic ways (such as address changes when a family moves, children moving out, people moving together, etc.)
Requirements/Prerequisites
Knowledge in data mining, entity resolution, databases, and good skills in Python programming (or a related programming language).
Student Gain
The student working on this project will:
- become familiar with various data quality issues,
- learn about data matching techniques, specifically those with relevance to data
generation (such as data pre-processing and cleaning, and approximate string
comparisons),
- become familiar with the Python programming language and its
object-oriented programming model,
- learn about open source software development,
- learn about the use of the Latex typesetting system and about
technical writing, and
- learn about effectively presenting their work to a general audience.
Links
Co-supervisor Dinusha VatsalanPaper "Accurate synthetic generation of realistic personal information"



