Skip navigation
The Australian National University

Student research opportunities

Evaluating and improving classification techniques for data matching

Project Code: CECS_765

This project is available at the following levels:
Honours, Masters

Keywords:

data matching, entity resolution, record linkage, classification, experimentation, evaluation

Supervisor:

Assoc Professor Peter Christen

Outline:

Data matching, also known as entity resolution or record linkage, is the process of identifying which records in two databases refer to the same real-world entity. Many different classification techniques for data matching have been developed in the past decade. However, so far no extensive comparative evaluation of such techniques on different types of data has been conducted.

Goals of this project

The objectives of this project are (1) to implement a comprehensive set of classification techniques into the Febrl open source data matching system (which as been developed by Peter Christen, see the link below), (2) to conduct experimental evaluations on a variety of data sets, both publicly available test data, synthetic data, and (possibly) large real-world data sets, and (3) based on what was learnt from these experiments to develop improved classification techniques for data matching.

Febrl already contains several basic classifiers as well as a data generator and a set of small test data sets, which can be used as an initial starting point for this project.

Requirements/Prerequisites

Having attended courses in algorithms and data structures, and data mining or machine learning are a requirement, while good programming skills preferably in Python, or C or Java are desirable.

Student Gain

A student working on this project will learn about different classification techniques for data matching, how to conduct comprehensive experimental evaluations, and how to develop software that will be integrated into an open source tool.

The novel classification techniques that will be developed are potentially of high interest to both government agencies and a variety of private sector organisations, as many of them conduct data matching projects on a routinely basis.

Background Literature

For papers about data matching please consult Peter's publication page given below.

Links

Peter's publications
Febrls Sourceforge.net home page

Contact:



Updated:  4 July 2012 / Responsible Officer:  JavaScript must be enabled to display this email address. / Page Contact:  JavaScript must be enabled to display this email address.