Student research opportunities
Research in data matching / record linkage / entity resolution
Project Code: CECS_110
This project is available at the following levels:
Honours, Masters, PhD
Keywords:
Data mining, data matching, data linkage, entity resolution, deduplication, data preprocessing, data cleaning, data integration, privacy preserving record linkage, Honours project, PhD project, MPhil project
Supervisor:
Assoc Professor Peter ChristenOutline:
Many organisations today collect massive amounts of data in their daily businesses. Examples include credit card and insurance companies, the health sector, taxation, social security, statistics, law enforcement and national security, and telecommunications. Data mining techniques are used to analyse such large data sets, in order to find patterns and rules, or to detect outliers.
In many data mining projects, information from multiple data sources needs to be integrated, combined or linked in order to allow more detailed analysis. The aim of such data linkages (also called data matching or entity resolution) is to merge all records that relate to the same entity, such as a customer or a patient.
Most of the time the linkage process is challenged by the lack of a common unique identifier, and thus becomes non-trivial. A variety of linkage techniques have been developed in recent times. They are mainly based on using the commonly available attributes, such as names, addresses, dates of birth, to find matching records.
Goals of this project
Various topics / research projects are possible within this research area. They include:
- developing improved unsupervised classification techniques for automated data matching;
- improving the scalability of data matching techniques to very large databases that contain many millions of records;
- developing parallel data matching techniques (for example parallelising algorithms into a MapReduce framework);
- and developing novel privacy-preserving data matching techniques that are scalable to very large databases.
Requirements/Prerequisites
Projects in this research area are available both as one-year computer science Honours, or as multi-year MPhil or PhD projects.
Students interested undertaking such projects as a MPhil or PhD student should hold the equivalent of an Australian Bachelors degree with Honours 2A level or above in computer science, and preferably have done their honours research in the areas of data mining or machine learning, and have a very good understanding of algorithms as well as good programming skills.
Further details about requirements for MPhil and PhD students are given in one of the links below.
Student Gain
A student working in a project in data matching will learn about various data pre-processing, data cleaning, data matching and data mining techniques, including multi-relation data mining, clustering, classification, link and graph mining, and be able to develop novel techniques that potentially are of high interest both government agencies and a variety of private sector organisations.
Background Literature
See the link provided to Peter's publication page below.
Links
Peter Christen's publicationsMore information for MPhil and PhD students



