Student research opportunities
Evaluating and improving classification techniques for data matching
Project Code: CECS_765
This project is available at the following levels:
Honours, Masters
Keywords:
data matching, entity resolution, record linkage, classification, experimentation, evaluation
Supervisor:
Assoc Professor Peter ChristenOutline:
Data matching, also known as entity resolution or record linkage, is the process of identifying which records in two databases refer to the same real-world entity. Many different classification techniques for data matching have been developed in the past decade. However, so far no extensive comparative evaluation of such techniques on different types of data has been conducted.
Goals of this project
The objectives of this project are (1) to implement a comprehensive set of classification techniques into the Febrl open source data matching system (which as been developed by Peter Christen, see the link below), (2) to conduct experimental evaluations on a variety of data sets, both publicly available test data, synthetic data, and (possibly) large real-world data sets, and (3) based on what was learnt from these experiments to develop improved classification techniques for data matching.
Febrl already contains several basic classifiers as well as a data generator and a set of small test data sets, which can be used as an initial starting point for this project.
Requirements/Prerequisites
Having attended courses in algorithms and data structures, and data mining or machine learning are a requirement, while good programming skills preferably in Python, or C or Java are desirable.
Student Gain
A student working on this project will learn about different classification techniques for data matching, how to conduct comprehensive experimental evaluations, and how to develop software that will be integrated into an open source tool.
The novel classification techniques that will be developed are potentially of high interest to both government agencies and a variety of private sector organisations, as many of them conduct data matching projects on a routinely basis.
Background Literature
For papers about data matching please consult Peter's publication page given below.
Links
Peter's publicationsFebrls Sourceforge.net home page

