Skip navigation
The Australian National University

Student research opportunities

Research in data matching / record linkage / entity resolution

Project Code: CECS_110

This project is available at the following levels:
Honours, Masters, PhD

Keywords:

Data mining, data matching, data linkage, entity resolution, deduplication, data preprocessing, data cleaning, data integration, privacy preserving record linkage, Honours project, PhD project, MPhil project

Supervisor:

Assoc Professor Peter Christen

Outline:

Many organisations today collect massive amounts of data in their daily businesses. Examples include credit card and insurance companies, the health sector, taxation, social security, statistics, law enforcement and national security, and telecommunications. Data mining techniques are used to analyse such large data sets, in order to find patterns and rules, or to detect outliers.

In many data mining projects, information from multiple data sources needs to be integrated, combined or linked in order to allow more detailed analysis. The aim of such data linkages (also called data matching or entity resolution) is to merge all records that relate to the same entity, such as a customer or a patient.

Most of the time the linkage process is challenged by the lack of a common unique identifier, and thus becomes non-trivial. A variety of linkage techniques have been developed in recent times. They are mainly based on using the commonly available attributes, such as names, addresses, dates of birth, to find matching records.

Goals of this project

Various topics / research projects are possible within this research area. They include:

  • developing improved unsupervised classification techniques for automated data matching;
  • improving the scalability of data matching techniques to very large databases that contain many millions of records;
  • developing parallel data matching techniques (for example parallelising algorithms into a MapReduce framework);
  • and developing novel privacy-preserving data matching techniques that are scalable to very large databases.

Requirements/Prerequisites

Projects in this research area are available both as one-year computer science Honours, or as multi-year MPhil or PhD projects.

Students interested undertaking such projects as a MPhil or PhD student should hold the equivalent of an Australian Bachelors degree with Honours 2A level or above in computer science, and preferably have done their honours research in the areas of data mining or machine learning, and have a very good understanding of algorithms as well as good programming skills.

Further details about requirements for MPhil and PhD students are given in one of the links below.

Student Gain

A student working in a project in data matching will learn about various data pre-processing, data cleaning, data matching and data mining techniques, including multi-relation data mining, clustering, classification, link and graph mining, and be able to develop novel techniques that potentially are of high interest both government agencies and a variety of private sector organisations.

Background Literature

See the link provided to Peter's publication page below.

Links

Peter Christen's publications
More information for MPhil and PhD students

Contact:



Updated:  5 September 2012 / Responsible Officer:  JavaScript must be enabled to display this email address. / Page Contact:  JavaScript must be enabled to display this email address.