I have received a Diploma in Computer Science Engineering from ETH Zurich in 1995, and a PhD in Computer Science from the University of Basel in 1999 (both in Switzerland).
My research in data mining and data matching has so far resulted in over 140 publications, including the book Data Matching by Springer in 2012. I am also the main developer of the `Febrl' (Freely Extensible Biomedical Record Linkage) open source data cleaning, deduplication and record linkage system.
My research is in the fields of data mining and data matching (also known as record linkage or entity resolution).
I am especially interested in the development of scalable and real-time algorithms for data matching, and privacy and confidentiality aspects of data matching and data mining.
Keywords: Data Matching, Record Linkage, Entity Resolution, Data Mining, Privacy Technologies, Privacy Preserving Record Linkage.
Recent years have seen an explosion in the amount of data that is being created by businesses, governments, as well as individuals (think of your photo collections!). Many organisations now have databases in the size of Petabytes (millions of Gigabytes). Analysing such large data collections through sophisticated computer programs that extract patterns, rules, or classes is the field of computer science known as Data Mining.
Data Mining is a multi-disciplinary and applied field of computer science that poses many challenges. The size and complexity of today's databases are two of these challenges. Data quality is another challenge - "Real data is dirty" is a common saying among those who work with large databases.
My research has focussed in Data Matching, which is the process of identifying and linking records that correspond to the same individuals in different databases when no identifiers are available. I have recently written a book on this topic.
I am also interested in Privacy Aspects related to data mining and data matching, a topic of high interest given many large organisations (like Google or Facebook) are collecting massive amounts of private information about individuals. The question we investigate is: Can we analyse large data collections without revealing any private or sensitive information.
I am teaching both undergraduate and Master level courses, including Programming for Scientists (COMP1730), where we teach students how to program in Python (a language used for example by Google and NASA), and Algorithms and Techniques for Data Mining (COMP8400).