Skip navigation
The Australian National University

Honours topics

Half of your time as an honours student is spent working on a project. But first you have to find a project topic.

The "official" reference for projects proposed by potential supervisors is the CECS projects database.

There are projects there available for all levels of research, including short projects, summerscholarship projects, Honours projects, Masters and PhD projects. All potential research students at any level are urged to browse it.

If you see a project that looks interesting, email the potential supervisor about it. Don't be afraid to discuss possible variations to the listed project: what appears on the web site is generally more of a suggestion than a rigid specification.

You don't have to be constrained by what you see on the project website. If you have something that you would like to work on as a project, then feel free to discus it with the honours convener to see if it could form the basis of an honours project and to identify a possible supervisor. Look at the web pages of Computer Science staff to find out their research interests. Remember that projects may also be supervised by people outside the College, or even ontside the University: from CSIRO or NICTA, for instance.

Former Project Topics

For your interest, here is an archive of Honours project proposals from previous years. Some of them are quite ancient now, of course, but they may help to give you ideas of the kind of thing which is suitable.
  • 2009 Proposed Project Topics
  • 2008 Proposed Project Topics
  • 2007 Proposed Project Topics
  • 2006 Proposed Project Topics
  • 2005 Proposed Project Topics
  • 2004 Proposed Project Topics
  • 2003 Proposed Project Topics
  • 2002 Proposed Project Topics
  • 2001 Proposed Project Topics


    2007 Honours project proposals




    Honours projects in data linkage

    Background

    Many organisations today collect massive amounts of data in their daily businesses. Examples include credit card and insurance companies, the health sector (e.g. Medicare), taxation, statistics, police/intelligence and telecommunications. Data mining techniques are used to analyse such large data sets, in order to find patterns and rules, or to detect outliers.

    In many data mining projects information from multiple data sources needs to be integrated, combined or linked in order to allow more detailed analysis. The aim of such data linkages is to merge all records relating to the same entity, such as a customer or patient. Most of the time the linkage process is challenged by the lack of a common unique identifier, and thus becomes non-trivial. Probabilistic data linkage techniques have to be applied using personal identifiers like names, addresses, dates of birth, etc.

    The ANU Data Mining Group is currently working in collaboration with the NSW Department of Health, Centre for Epidemiology and Research on the development of improved techniques and algorithms for large scale high-performance data linkage. We have been developing an open source data linkage system called Febrl (Freely extensible biomedical record linkage), which includes modules for data cleaning and standardisation (of names, addresses, dates of birth, etc.), deduplication, data linkage (or matching), and geocoding.

    For more information please have a look at our publications as well as presentations that describe our research in more details.

    The following student projects address important issues within our data linkage project.

    Supervisor for all projects: Peter Christen


    Large scale parallelisation of data linkage techniques

    One of the problems when linking large collections of data with tens or even hundreds of millions of records is the complexity of data linkage techniques. Potentially each record in one data set has to be compared with every record in the second data set, resulting in a quadratic complexity O(n2). Techniques known as blocking are normally used to reduce the number of record pair comparisons (for example, only records that have the same postcode value are compared with each other). However, linking large data sets still takes a lot of time.

    Student research project

    The aim of this project is to develop parallel and distributed data linkage techniques that allow the linking of large data sets on clusters of workstations or personal computers (which are available in many organisations). Issues which have to be considered include:

    • Data distribution and load balancing
    • Scalability
    • Fault tolerance
    • Privacy and confidentiality (as names and addresses are normally used for the linkage)
    A student working on this project would be involved through

    • Participation in an exciting research project.
    • Exploration and development of parallel and distributed computing techniques, as well as statistical, machine learning, and data mining techniques.
    • Object-oriented cross-platform programming (using Python and friends).
    • Open source software development (see http://sourceforge.net/projects/febrl).

    For more information, publications, and further links please visit the project web page at http://datamining.anu.edu.au/linkage.html.


    Improved record pair classification techniques

    A crucial step in the data linkage process is the classification of pairs of records that are being compared into either matches or non-matches. For example, consider the following two records containing a name and address each:

    Dr Smith, Peter 42 Miller Street 2602 O'Connor
    Pete Smith 42 Miller St 2600 Canberra A.C.T.

    Are these two records representing the same person, or two different people?

    In the past few years researchers from the machine learning and AI communities have explored various classification techniques, including decision trees, genetic algorithms, clustering, active learning, support vector machines, and instance-based classifier to improve the classification of record pairs. Other researchers have applied techniques from information retrieval and database research to improve the linkage classification quality. However, this research has been fragmented and so far no large study comparing the many classifiers has been conducted using different data sets (both synthetic and from the real world).

    Student research project

    The aim of this project is to (1) implement prototypes of different classifiers and then conduct studies comparing their performance using different data sets, and to (2) develop new - based on existing methods - classifiers that are well suited for the data linkage classification problem. Issues involved are:

    • Which classifier performs best for what types of errors in the data (e.g. simple typographical errors, missing words, swapped words, etc.)
    • Does a combination of classifiers work better than one single classifier?
    • Comparison of supervised techniques and un-supervised techniques.
    • Exploration of active learning techniques (in order to reduce the manual effort required to create training data)
    A student working on this project would be involved through

    • Participation in an exciting research project.
    • Exploration and development of statistical, machine learning, and data mining techniques.
    • Object-oriented cross-platform programming (using Python and friends).
    • Open source software development (see http://sourceforge.net/projects/febrl).

    For more information, publications, and further links please visit the project web page at http://datamining.anu.edu.au/linkage.html.


    Probabilistic data set generation

    One of the difficulties is getting access to real world data sets containing real names and addresses, as such data is protected by privacy and confidentiality regulations. Developing and testing new and improved data linkage algorithms therefore becomes difficult. We have developed a data set generator (which is part of Febrl). However, this generator is very simple, and various ways of improving it are possible. For more details please have a look at the recent publication Probabilistic Data Generation for Deduplication and Data Linkage.

    Student research project

    The aim of this honours research project is to further explore how our simple data set generator can be improved. This would include analysing the literature in this area (artificially creating data sets), analysing the existing data generator code, and developing improved techniques that reflect reality more accurately. Issues involved are:

    • Accurately model typographical errors as introduced by (a) scanning and optical character recognition (OCR), (b) keying errors (i.e. errors introduced when people type data into a keyboard), and (c) phonetical errors introduced when data is entered over the phone (and is manually typed in).
    • Model the dependencies between attributes (for example if a person moves, most of her or his address details will change).
    • Model household and family data (i.e. not just generate single records independently, but create sets of records for families, or shared households).

    A student working on this project would be involved through

    • Participation in an exciting research project.
    • Exploration and development of statistical, machine learning, and data mining techniques.
    • Object-oriented cross-platform programming (using Python and friends).
    • Open source software development (see http://sourceforge.net/projects/febrl).

    For more information, publications, and further links please visit the project web page at http://datamining.anu.edu.au/linkage.html.


    Privacy-preserving data linkage

    Background

    When linking data sets, often no common unique entity identifiers (keys) are available in all data sources, and so the linkage needs to be performed using the available identifying attributes, like names and addresses. Data confidentiality often limits or even prohibits successful data linkage, as either no consent can be gained (for example in biomedical studies) or the data holders are not willing to openly provide their data.

    • A medical research project, for example, may need to determine which individuals, whose identities are recorded in a population-based register of people suffering from hepatitis C infection, also appear in a separate population-based register of cases of liver cancer. Clearly such linkage between these two databases involves the invasion of privacy of the individuals whose details are stored in these databases. It is usually infeasible to obtain consent from each individual identified in each of the register databases - instead one or more ethics committees or institutional review boards provide consent for the linkage of the two databases on the behalf of the individuals involved, after weighing the public good which is expected to accrue from the research against the unavoidable invasion of individual privacy which the research involves.

    • Similarly, consider two research databases of genetic sequences of DNA, each of which are commercially valuable in their own right. The owners or custodians of these two databases may wish to determine whether their databases contain any sequences in common before deciding to collaborate and share their data, without having to first reveal the contents of their database to the other party, or to a third party.
    In both cases, comparisons need to be made between sequences of symbols - alphanumeric characters like names and addresses in the first example, nucleic bases in the second - in order to measure their similarity, without revealing what those sequences are.

    The ANU Data Mining Group is currently working in collaboration with the NSW Department of Health, Centre for Epidemiology and Research on the development of blind data linkage methods (i.e. methods where data linkage can be performed between different parties without the disclosure of any personal or valuable information). To achieve this goal, we are using various techniques from cryptology and related areas. For more details please have a look at the recent publication Some methods for blindfolded record linkage.

    Student research project

    The aim of this honours research project is to further develop this blind data linkage techniques. So far we have implemented simple proof-of-concept programs written in Python. Performance issues with large (real world) databases need to be addressed, as well as distributed services and Internet connectivity. One aim is to integrate blind data linkage into our open source data linkage system Febrl.

    A student working on this project would be involved through

    • Participation in an exciting research project in collaboration with an industry partner.
    • Exploration and development of novel privacy preserving data linkage methods using algorithms and techniques from cryptography, machine learning, statistics, indexing, data mining, and distributed and remote computing.
    • Object-oriented cross-platform programming (using Python and friends).
    • Open source software development (see http://sourceforge.net/projects/febrl).

    For more information, publications, and further links please visit the project web page at http://datamining.anu.edu.au/linkage.html.



    Apply Dynamical Bayesian Network to Query Digital Forensics

    Supervisors: Professor Terry Caelli, VISTA Program, NICTA and RSISE, ANU
    Dr Nianjun Liu, VISTA Program, NICTA and RSISE, ANU
    Email: Terry.Caelli@anu.edu.au and Nianjun.Liu@nicta.com.au

    Digital forensics undertakes the post-mortem reconstruction of the causal sequence of events arising from an intrusion perpetrated by one or more external agents, or as a result of unauthorised activities generated by authorised users, in one or more digital systems. The field of digital forensics covers a broad set of applications, uses a variety of evidence and is supported by a number of techniques. Application areas include forensic accounting, law enforcement, commodity flow analysis and threat analysis. Forensic investigations often focus on unusual and interesting events that may not have arisen previously. A major objective of a digital investigation is to extract these interesting pieces of evidence and to identify the causal relationship between this evidence. 

    This project aims at extending an existing Dynamical Bayesian Network model developed for digital forensics by investigating a number of possible topics. The model developed uses a Bayesian network and hidden Markov model network structure to (i) estimate typical digital crime scenario models from data and (ii) given such models, infer the most likely criminal act given current observations and past criminal acts.

    First,the addition of multi forward linkages in the BN+HMM network structure to allow direct linkages between BN nodes at consecutive time intervals. Second is the design of an SQL based query tool to explore the activities of criminals and their interactions and explain what happened in the past, as well as predict what will happen in the near future. Finally, the application of a graphical model to data mining of relational digital forensic databases, including construction of a relational pattern structural database for known types of digital crime portfolios and their associated forensics Bayesian Network models.

    Applicants must have a major in information technology, computer science, or electrical engineering, preferably with strong mathematics background and understanding of probability and statistics, with experience in the data analysis and data mining, pattern recognition and machine learning, and also having excellent programming abilities (MATLAB, C/C++ and JAVA).

    Contact Dr Nianjun Liu (nianjun.liu@nicta.com.au)


    Intelligent Environmental Query on Spatial Data

    Supervisors: Dr Nianjun Liu, VISTA Program, NICTA and RSISE, ANU
    Professor Terry Caelli, VISTA Program, NICTA and RSISE, ANU
    Email: Nianjun.Liu@nicta.com.au and Terry.Caelli@anu.edu.au

    Analysis of spatial information in natural resources management is crucial to support a decision making process. However, with the advent of various technologies to acquire the data, analysis of multiple spatial data becomes a very challenging area. Those technologies will produce different accuracy and different resolution in the data. In spite of multi representation of spatial data, evidences of an area can be from different time and different observer’s view that makes combining those evidences is quite complicated. Combining spatial data or evidences is not just simply combining evidences from different technologies, but it is also combining multi criteria evidences.

    Australian Bureau of Rural Science (BRS) has developed a system known as multi criteria analysis shell for spatial decision support (MCAS-S). The project is to incorporate an option for use of (Dynamical) Bayesian network approaches to model multiple types of evidences for intelligent environmental query and decision supports.

    The project involves various techniques: image processing to preprocess the spatial GIS data, machine learning, pattern recognition and probability theory. Applicants must have a major in information technology, computer science, or electrical engineering, preferably with strong mathematics background and understanding of probability and statistics, with experience in the data analysis and data mining, pattern recognition and machine learning, and also having excellent programming abilities (MATLAB, C/C++ and JAVA).

    Contact Dr Nianjun Liu (nianjun.liu@nicta.com.au)


    Intelligent Land Planning on Relational Spatial Data

    Supervisors: Dr Nianjun Liu, VISTA Program, NICTA and RSISE, ANU
    Professor Terry Caelli, VISTA Program, NICTA and RSISE, ANU
    Email: Nianjun.Liu@nicta.com.au and Terry.Caelli@anu.edu.au

    Multi-criteria approaches to the analysis of complex issues in environment decision systems have found wide application across business, government and communities around the world. Such approaches may be readily applied in the context of land planning, which is a prerequisite to the development of a city, town or suburb. Generally, planners collect a range of information about an area, including information about natural resources, topography, demographics, political issues, economic characteristics and proximity to neighbouring settlements and services, and combine this information to make planning decisions. Computer aided Multi-criteria decision support tools allow for measurement and analysis of alternatives or options, involving a variety of both qualitative and quantitative dimensions.

    The project involves collaboration with the ACT Planning and Land Authority (ACTPLA) to present a sample demonstration of an Intelligent Land Planning tool using Bayesian Networks. Specifically, it will aim to develop a tool for the selection of optimal sites for community services within existing areas of settlement in the ACT, including recreational facilities, schools, childcare centres, aged care facilities and community centres. The factors of interest include existing settlement patterns, demographics (current and anticipated), future development, available land resources, existing services and community need.

    Bayesian Network is designed to apply when there is uncertainty about evidence and how it should be combined in decision making. The proposed approach is to use a supervised strategy whereby experts provide known thematic layers of the land cover GIS spatial database as known successful decisions. The Bayesian Networks then trained to optimize the predications of such decision with the aim of applying the optimal decision model to new situations or scenarios.

    After exploring the ACT spatial database and other data sources, the scholar will identify the relevant decision factors with the aid of ACTPLA experts and will then build the Bayesian Network model. After the iterative data mining on the relational database, the model will be continuously learned and its structure and parameters adjusted accordingly. Finally, the scholar will create the new tool and test it in the real-world context within ACTPLA.

    The project involves various techniques: spatial relational database, Structure Query Languages (SQL), machine learning, pattern recognition and probability theory. Applicants must have a major in information technology, computer science, or electrical engineering, preferably with strong mathematics background and understanding of probability and statistics, with experience in the spatial data base and data mining, pattern recognition and machine learning, and also having excellent programming abilities (MATLAB, C/C++ and JAVA).

    Contact Dr Nianjun Liu (nianjun.liu@nicta.com.au)



    Projects in Artificial Intelligence (diagnostic reasoning, planning, scheduling, search, optimisation, game playing and learning)"



    Instance Based Methods

    Supervisors: Dr Peter Baumgartner, Prof John Slaney
    E-Mail: Peter.Baumgartner AT nicta.com.au

    Formal logic provides a mathematical foundation for many areas of computer science. Logical languages are used as specification language within, e.g., program development and verification, hardware design and verification, relational databases, and many subfields of Artificial Intelligence. Automated Deduction is concerned with the design and implementation of algorithms based on logical deduction for solving problems in these areas. Instance Based Methods are a relatively new family of methods for Automated Deduction; they share the principle of carrying out proof search by maintaining a set of instances of input formulas and analyzing it for satisfiability until completion.

    I offer several research-oriented honours projects centered around Instance Based Methods, in particular in connection (experimentation/extension) with our own Darwin system (http://goedel.cs.uiowa.edu/Darwin/).


    A circuit-based algorithm for solving quantified Boolean formulae

    Supervisor: Dr Jussi Rintanen
    E-Mail: rintanen-AT-informatik.uni-freiburg.de

    Dr Rintanen will join NICTA's Knowledge Representation and Reasoning Program (KRR) in January 2006. He is currently with the University of Freiburg, Germany. The local contact for this project before Dr Rintanen's arrival is Dr Sylvie Thiébaux.

    Background

    Following the success of satisfiability algorithms in solving a wide range of practically important problems in computer aided verification, artificial intelligence and other areas, the idea of using algorithms for quantified Boolean formulae (QBF) for similar but more general problems has become increasingly attractive. One of the problems present in the current QBF algorithms is that they are defined only for QBF that have their matrices in conjunctive normal form (CNF), and the most natural representations of many problems in QBF involve arbitrary Boolean formulae or circuits and do not have simple translations to CNF.

    Honours project

    The goal of the research project is to develop a proof system for QBF that are represented as arbitrary Boolean circuits, not restricted to CNF, and to develop an efficient QBF algorithm based on the proof system. In comparison to circuit-based proof systems for the classical propositional logic, this involves identifying proof rules for quantifiers and universally and existentially quantified variables, as well as rules for efficient scoping of variables. The QBF algorithm will be compared to CNF-based QBF algorithms which are used in connection with translations of circuit-QBF into QBF in CNF. T-anu.edu.au

    Compact representation of schematic operators in planning as satisfiability

    Supervisor: Dr Jussi Rintanen
    E-Mail: rintanen-AT-informatik.uni-freiburg.de

    Dr Rintanen will join NICTA's Knowledge Representation and Reasoning Program (KRR) in January 2006. He is currently with the University of Freiburg, Germany. The local contact for this project before Dr Rintanen's arrival is Dr Sylvie Thiébaux.

    Background

    Translation into the classical propositional logic has been proposed as a powerful and general solution technique for classical AI planning. The actions of many planning problems are best represented in terms of schemata that are instantiated with given objects to obtain the actual actions. For example, a generic schema for the movement of vehicles between locations is IN(V,X) => -IN(V,X), IN(V,Y) in which V can be instantiated with any vehicle and X and Y with any locations. Kautz and Selman [1996] and Ernst et al. [IJCAI'1997] have considered compact representations of schematic actions in the propositional logic. However, these works restrict to sequential plans, and there is a lot of potential to improve the efficiency of planning further by considering some notion of parallel plans which allows the execution of several actions simultaneously.

    Honours project

    The goal of the research project is to develop compact representations of schematic actions for parallel plans in the propositional logic. If the regularities in the actions described by an action schema are recognized, a more compact representation of the planning problem is possible, potentially leading to much more efficient planning. The work includes extending the ideas presented by Kautz and Selman and Ernst et al. to parallel plans as well as developing new techniques for compact action representation, and then implementing these techniques as a part of a planning system.

    Supercom

    Supervisor: Dr Sylvie Thiébaux, Dr Anbulagan, Dr Olivier Buffet, Dr Jinbo Huang, Dr Yannick Pencolé
    E-Mail: sylvie.thiebaux-AT-anu.edu.au, anbulagan-AT-nicta.edu.au, olivier.buffet-AT-nicta.edu.au, jinbo.huang-AT-nicta.edu.au, yannick.pencole-AT-anu.edu.au

    Background

    Composite systems feature simple components organised into a highly reconfigurable and modular architecture, due to which they can appropriately assemble to achieve complex operations. Examples include web and grid services, energy distribution systems, water management systems, chemical plant control systems, telecommunication and computer networks, to name but a few.

    In the "Supercom" project, we are interested in building Artificial Intelligence tools to supervise such composite systems. These will need to confer the system the ability to self-monitor (diagnose) and self-organise (assemble, reconfigure) to achieve optimal performance.

    In this framework, various projects are available on topics such as decentralised diagnosis [1], decentralised planning [2] or diagnosis using satisfiability and knowledge compilation methods [3]. The main supervisor will depend on the project, but interactions with other researcher should appear naturally. A strong background in computer science, software engineering, or mathematics, and an interest in artificial intelligence are required, as well as good programming skills.

    References:

    [1] Yannick Pencolé, Marie-Odile Cordier A formal framework for the decentralised diagnosis of large scale discrete event systems and its application to telecommunication networks, Artificial Intelligence, Volume 164, Number 1-2, May 2005.
    [W: http://www.informatik.uni-rier.de/%7Eley/db/journals/ai/ai164.html#PencoleC05]

    [2] E. Amir and B. Engelhardt, Factored Planning, in 18th Intl' Joint Conference on Artificial Intelligence (IJCAI'03), 2003.
    [W: http://www.cs.uiuc.edu/~eyal/papers/factored-plans-ijcai03.pdf]

    [3] Look-Ahead Versus Look-Back for Satisfiability Problems , Chu Min Li and Anbulagan, CP, 1997, Linz, Austria.
    [W: http://web.rsise.anu.edu.au/~anbu/papers/CP97Anbulagan.pdf]


    Word Processing Structure Analyser

    Supervisors: Dr Ian Barnes and Dr Peter Raftos

    Background

    An important part of current research being done into the long- term preservation and accessibility of word processor (WP) documents is the conversion of (primarily) Word and OpenOffice files to some structured XML (usually DocBook-XML).

    This work assumes that the WP documents have already been structured in a systematic manner using the WP software's inbuilt style functions. However, most users do not use styles; generally, they are not even aware they exist. And even if every user were to start using them from today, there remains a legacy of unstyled (and therefore unstructured) WP documents.

    Most users rely on the look of document content to deduce structure: reasonably arbitrary choices of font size and weighting. It would be very useful if algorithms could be developed that might analyse (or u201cguessu201d) at the structure of a WP document. This should then lead to the development of a working tool to perform the structuration of unstructured WP documents.

    The Work

    Using a series of unstructured trial documents of varying complexity develop:

    1. Algorithm(s) that can deduce with a reasonable level of accuracy (80-90%), structure within the document. Basic structure includes body text and headings from level 1 to 6, as well as associated sections, multi-level lists, both numbered and bulleted, block quotes and definition lists. This may be expanded.

    2. Using the algorithms, a Java/XML-based tool to convert set WP documents from Word or OpenOffice formats to structured (that is, styled) OpenOffice documents, suitable for submission to Dr Ian Barnes' Scholars Workbench (SWB) application.

    There may be a scholarship attached to this project.

    RFID implementation and privacy

    Supervisors: Dr Michael Shen (CSIRO), Mr Tom Worthington (DCS,ANU) and A/Prof. CHris Johnson (DCS, ANU)
    E-Mail: Michael.Sheng@csiro.au
    E-Mail: Tom.Worthington@anu.edu.au
    E-Mail: cwj@cs.anu.edu.au

    Background

    Radio Frequency Identification (RFID) technology uses radio-frequency communication to transfer data between tagged objects and readers, creating the possibility of a physically linked world where every object is numbered, identified, catalogued, and tracked. There are many innovative and high payoff RFID applications such as tracking doctors/patients in hospitals to improve efficiency, marking airline bags to reduce lost bags, tracking people in refugee camps to ease the post-disaster relief (e.g., helping people find their missing relatives). Currently, both academia and industrial major players like MIT, Stanford, SAP, Microsoft, and Sun are investigating heavily in R&D to enable RFID to realize its potential. This project centres around the investigation and implementation of RFID technologies. In particular, we will first build an experimental environment using RFID tags and readers and perform RFID data translation and transformation. We will also study the RFID standards including EPCGlobal network. The final stage of the project is the design and implementation of an RFID application for tracking people in refugee camps based on the open source Sahana (http://www.sahana.lk/). Subject to the studentu2019s interests, there is also a chance to investigate the privacy issue associated with this application. This project presents an exciting opportunity for the student to obtain sound and fundamental knowledge of the hot area and gain research skills and experience. The student will work with researchers from CSIRO ICT Centre and ANU. Skill Prerequisites: No prior knowledge on RFID is required, but good programming skills in Java is essential. Knowledge on database (e.g., Oracle), XML, and Web services is a plus.


    A compiler for program synthesis

    Supervisors: Dr Michael Comton, Dr Mark Cameron (CSIRO) A/Prof Rajeev Gore, Dr Alwen Tiu (RSISE, ANU)
    E-Mail: Mark.Cameron@csiro.au
    E-Mail: Rajeev.Gore@anu.edu.au

    Background

    This project is about writing a compiler for automatic synthesis of 'programs' from specifications. Given the specifications for services S1,...,Sn and the specification of goal G the compiler will generate a 'program' satisfying G by using the services S1,...,Sn. This compiler could be targeted to a number of possible settings. For example, the specifications could describe web services and the goal in this case is automatic generation of composite web services from other web services. Alternitively, the specifications could describe components of a security system and the goal some larger securty property thus the compiler automatically generates the larger security framework. The Information Engineering Lab at CSIRO's ICT centre already has such a compiler for web services [1], in which a logic mapping language relates resources, such as data and web-services, to relevant concepts and properties [2], and the composition compiler automatically translates a high-level declarative composition goal into an executable workflow responsible for orchestrating the data and web-service calls necessary to materialize the specified data product. But, we are currently restricted to composing over services that have stateless, functional and deterministic operations. We would like to extend this composition environment to work with web-services that have a stateful interaction style (i.e. the operation/message that the service can perform depends on the current state of the service and the service state machine). A published method for web Service composition [3,4,5] does so, by using a modal logic and tableaux theorem proving. But it concentrates on message flow, not data. We would like to integrate the two. The project would involve investigating how to integrate both data and state into the specification of services and a composition goal. The outcome would be a logic for describing the specification of each and a compiler that can automatically generate a 'program' satisfying the goal by using the provided services. The project would suit a student with an interest in logic and specification. Refrences [1] Cameron, M. A., Taylor, K. L., Baxter, R. Web-Service Composition and Record Linking VLDB Workshop on Information Integration on the Web (IIWeb-2004) Toronto, Canada, August 2004. URL: http://cips.eas.asu.edu/iiwebfinalproceedings/31.pdf [2] Cameron, M.A. and Taylor, K. (2005), First order patterns for information integration, Proceedings of the 5th International Conference on Web Engineering (ICWE2005), 25-29 July, Sydney, Australia. URL: http://dx.doi.org/10.1007/11531371_25 [3] Daniela Berardi, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Massimo Mecella (2004), Synthesis of Underspecified Composite e-Services based on Automated Reasoning in Proc. of the 2nd Int. Conf. on Service Oriented Computing (ICSOC 2004). URL: http://www.inf.unibz.it/~calvanese/papers-html/ICSOC-2004.html [4] Daniela Berardi, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Massimo Mecella (2004), ESC : A Tool for Automatic Composition of e-Services based on Logics of Programs in Proc. of the 5th VLDB Workshop on Technologies for E-Services (TES 2004). URL: http://www.inf.unibz.it/~calvanese/papers-html/TES-2004.html [5] Daniela Berardi, Diego Calvanese, Giuseppe De Giacomo, Rick Hull, and Massimo Mecella (2005), Automatic Composition of Transition-based Semantic Web Services with Messaging in Proc. of the 31st Int. Conf. on Very Large Data Bases (VLDB 2005). URL: http://www.inf.unibz.it/~calvanese/papers-html/VLDB-2005.html

    Tableaux reasoning for web service composition

    Supervisors: Dr Mark Cameron (CSIRO) A/Prof Rajeev Gore (RSISE, ANU)
    E-Mail: Mark.Cameron@csiro.au
    E-Mail: Rajeev.Gore@anu.edu.au

    Background

    The Information Engineering Lab has a logic-based service composition framework that is being applied across a number of domains, including Health Data Integration (Cameron et al) http://cips.eas.asu.edu/iiwebfinalproceedings/31.pdf and environmental management http://www.iemss.org/iemss2002/proceedings/pdf/volume%20uno/304_cameron. pdf . A published method for web Service composition (Berardi et al 2004) http://www.inf.unibz.it/~calvanese/papers-html/ICSOC-2004.html http://www.inf.unibz.it/~calvanese/papers-html/TES-2004.html (Berardi et al 2005) http://www.inf.unibz.it/~calvanese/papers-html/VLDB-2005.html relies on a translation of finite-state machine representations of a composition goal and stateful Web services into a modal/description logic, PDL. Tableaux theorem proving can then be applied to find a web service composition that statisfies the goal. This project would aim to replicate and implement that basic approach, but considering alternative logics to improve the scope of the compositions that may be achieved.


    Eye-gaze tracking as an interface for virtual environments

    Supervisors: Chris Gunn (CSIRO) Henry Gardner (DCS ANU)
    E-Mail: Chris.Gunn@csiro.au
    Henry.Gardner@anu.edu.au

    Background

    This project will seek to characterise the effectiveness of eye-gaze tracking as a human-computer interface for virtual environments. The concept is that participants should be able to interact with projected images using a combination of keyboard controls and eye gaze. By focussing their visual attention on icons or images on the screen, users should be able to select these icons or activate other interaction. Two application areas of interest are the Mind Attention Interface for the Wedge virtual reality theatre and a something-or-other system in which participants stand in front of a half-spherical shell screen. In the Mind Attention Interface, eye-gaze tracking equipment will be integrated with electroencephalogram (EEG) equipment which measures electric signals from the brain of a participant. Correlating eye-gaze measurements of attention with EEG data would be a possible extension of this project.


    Performance Instrumentation for the SCASH Distributed Shared Memory System

    supervisors: Dr Peter Strazdins

    A cluster computer such as the ANU Beowulf is a parallel computer using (typically Commercial-off-the-Shelf) switch-based network to communicate between the processors. These have proved a highly cost-effective high performance computer model, and have largely displaced the traditional massively parallel computers built with expensive vendor-supplied networks.

    Exploiting the price-performance `sweetspot' even more finely, clusters whose nodes are made of low-end (2 or 4 CPU) SMPs have become popular. The Beowulf has 2-way SMP's; the DCS Sunnyvale cluster at DCS has 4-way SMPs.

    It is possible to model the execution time of parallel application using simple algebraic equations. For example, the time to model sending a message of n words can be given by:

      t(n) = a + b * n
    where a represents the startup cost (latency) of a message, and b represents the transmission cost per unit word. These parameters are characteristic of the cluster's communication network. The execution times of more complex communication patterns can expressed in terms of these parameters; a recent project has already studied this on the Beowulf (assuming only one process per node, see below)

    Similarly, the execution time a loop of n iterations can be modelled by an equation of a similar form. In this way, one can model the execution time of computations such as parallel LU decomposition, first described in LAPACK working note 80.

    The above techniques worked well for clusters and massively parallel computers with one CPU per node. Parallel applications on SMP clusters, however, are typically run with one process allocated to each CPU, so that there are multiple processes per cluster node. Here processes on the same node communicate via a shared memory communication transport, at a considerably faster rate than messages over the network. On the other hand, processes on node A may simultaneously try to send messages to processes on node B; in these case message speed is slowed down due to contention over the single link between A and B.

    Thus, modelling the execution time of a communication operation such as a broadcast amongst a subset of these processes is not-trivial, as it may have to take into account both of these effects.

    This project will investigate how to model execution times on SMP clusters, building up from communications patterns to `regular' applications, such as dense linear systems solution. Probabilistic techniques may be required to model the effects of contention. The ANU Beowulf ant the SC will form suitable experimental platforms. The project has the potential for publishable work.

    References

    See the links above, and also:

    Extended Threads Emulation in an SMP Computer Simulator

    Supervisor: Dr Peter Strazdins and other members of the Sparc-Sulima team.

    Simulation is playing an increasingly important role in the design and evaluation of current and future high performance computer systems. Sparc-Sulima is an UltraSPARC SMP full-machine simulator currently being developed by the CC-NUMA Project (before that by the ANU-Fujitsu CAP Project). Sparc-Sulima simulates an UltraSPARC processor by interpreting each instruction.

    Currently, Sparc-Sulima can simulate an unmodified Solaris (UltraSPARC) binary (executable program). When that program executes a system call trap instruction, the corresponding operating system effects are `emulated' (the simulated computer state is updated directly, rather than via simulating a lengthy and complex sequence of kernel-level instructions), and simulated execution then resumes. This mechanism is handled by the SolEmn emulation layer in Sparc-Sulima. Threaded programs (e.g. using the pthreads Posix thread libraries) ultimately access thread-related operating system services via the Solaris _lwp (lightweight process) trap instructions. The limitation of the current implementation is that, at any time, there can only be one active thread per CPU. Such a limitation has been in all emulation-based computer simulators in the literature also.

    The aim of this project is to extend the emulation of the _lwp trap instructions to remove this limitation. It will require implementing thread switching, which can occur by time-outs, and also by the the _lwp_suspend and yield system calls. The current thread context save/restore mechanisms will also have to be extended. If successful, the work should be publishable.

    An extension of this project, more implementation oriented, would be to simulate, rather then emulate the _lwp traps. This would involve writing a mini-kernel that would perform an implementation of lightweight threads. Now, when Sparc-Sulima executes a threaded program and encounters an _lwp trap, it will starts executing the kernel, just like on a real machine. Note that the same underlying algorithms of the emulation mode can be used; however, some considerations must now be given to the `locking' of shared data structures. The advantage of this is that kernel-level (notably memory system-related) overheads can now be recorded by the simulator. Thus, the simulator will be able to predict more accurately the threaded program's overheads.

    This project is related to the project Using Threading Techniques to speed up SMP Computer Simulation. It is part of the CC-NUMA Project.

    References

    • Slides for a background talk for Sparc-Sulima.
    • Pthreads programming by O'Reilly, and other books on POSIX threads. Solaris on-line documentation on Solaris threads, including man -s 2 intro.
    • from the Sparc-Sulima home page, there is a survey paper on machine simulation (citing many more references) and also papers on the implementation of Sparc-Sulima and SolEmn. There are manuals and source codes too.


    Parallelizing Applications on a Heterogeneous Cluster

    supervisors: Dr Peter Strazdins, Dr Alistair Rendell, and Mr Richard Alexander

    A Beowulf-style cluster computer is a parallel computer using Commercial-off-the-Shelf switch-based network to communicate between the processors. The ANU Beowulf cluster Bunyip is such a cluster based on Fast (100Mb) Ethernet switches. Clusters have proved a highly cost-effective high performance computer model, and have largely displaced the traditional massively parallel computers built with expensive vendor-supplied networks.

    A typical cluster as Bunyip tends to evolve, as new processors and networks (and the monies to pay for them!) become available. Thus clusters tend to become heterogeneous. Furthermore, for a particular application, there tends to be a particular cluster configurations that is optimal. For the case of a (high performance) computing service center, design to serve many such applications, a cluster may be designed to be heterogeneous from the beginning.

    The efficient parallelization of applications using the Message Passing Interface (MPI) is reasonably well understood for homogeneous clusters. However, for heterogeneous clusters, which can not only have nodes with different processor architectures but also sections of nodes connected by different networks, the situation is still far more challenging. In particular, differences in processors and networks speeds make it difficult to utilise the faster components more effectively.

    In partnership with the local company Alexander Technology, who will install in 2006 a heterogeneous cluster in DCS called Jabberwocky, this project will investigate techniques used so far to parallelize (scientific/engineering) applications for heterogeneous clusters. The project will then make a selection of these applications and evaluate the effectiveness of these techniques on the Jabberwocky. New or variant techniques may then be proposed and evaluated.

    This project is of strategic interest not only to DCS but Alexander Technology, who is rapidly expanding its research and development in cluster technology. Scholarships may be available for suitably qualified applicants.

    References

    See the links above, and also:

    Automatic Tuning of the High Performance Linpack Benchmark

    Supervisor: Dr Peter Strazdins and Dr Alistair Rendell and Warren Armstrong (PhD student)

    Supercomputers today are large-scale parallel processors, and the Top 500 list is the most definitive ranking of the world's fastest. While the status of this list is contentious, the list is based on the results of a single benchmark - the Linpack benchmark - which in this case solves a (very) large dense linear system. The list is dominated by Beowulf-style cluster computers: a parallel computer using (typically Commercial-off-the-Shelf) switch-based network to communicate between the processors. There is a publically available open-source code for a highly optimized benchmark called High Performance Linpack (HPL). It is also highly tunable, but the tuning currently must be done by hand. This is problematic, in the sense that it requires both knowledge if the complex algorithms used in HPL (beyond the understanding of most people who wish to use the benchmarks) and of the (cluster) supercomputer itself. Furthermore, in the case of clusters, this problem is exacerbated by the fact that the cluster may be non-homogeneous (so the benchmark results may depend on which parts of the cluster are selected), and by the fact that by their nature clusters may be easily reconfigured. Thus, it is highly desirable that this tuning process can be automated. The LAPACK For Clusters (LFC) project went partway in addressing this, in that some work on determining, given p processors, to the best logical P x Q logical processor configuration, where P*Q <= p was made. However, automatically choosing the critical blocking parameter and a plethora of algorithm variants remains to be done. This project will investigate automatic tuning methods for the HPL package for cluster computers. The search can be done statically, and employ efficient empirical methods based on techniques such as the Nelder-Mead simplex search, as is done for the ATLAS package (which itself can be used as a component of HPL). The DCS Beowulf, Sunnyvale and Saratoga clusters will form suitable experimental platforms. The project has the potential for publishable work and for the production of widely used open-source software. References See the links above, and also: * The IEEE Task Force on Cluster Computing home page. * The ANU Beowulf home page. There is a also paper on solving dense linear systems on this platform (exposing some problems with HPL) * Dongarra, J et al. Self Adapting Numerical Software (SANS) Effort, IBM Journal of Research and Development 50(2-3), 2006.


    Enhancing Communication using Xen in SMP Clusters

    Supervisor: Dr Peter Strazdins

    A Beowulf-style cluster computer is a parallel computer using Commercial-off-the-Shelf switch-based network to communicate between the processors. The ANU Beowulf cluster Bunyip is such a cluster based on Fast (100Mb) Ethernet switches. Clusters have proved a highly cost-effective high performance computer model, and have largely displaced the traditional massively parallel computers built with expensive vendor-supplied networks. Parallel programs for clusters are typically written using the MPI Message Passing Interface library. Xen is a technology providing virtualization of operating system services, emerging as one of the hottest topic in operating systems R&D. It permits, for example, multiple hosts to run on an SMP (multiple CPU) machine, binding an independent operating system on each CPU. In the HPC context, virtualization holds many potential advantages, such as allowing easier job migration and permitting per-user OS customization. The latter is not only useful for performance, but may be critical for complex applications to run at all (as these often require a particular (Linux) distribution). However, the concern for virtualization is its potential overhead. Linux clusters with SMP (multiple CPU) nodes with interconnects such as Gigabit Ethernet may be configured with multiple Network Interface cards (NICs). A question which naturally arises is how to optimally configure the hardware (NICs) and the software which drives them. Alternatives range from the CPUs sharing channel-bonded NICs through to dedicating a NIC to a particular CPU. In the latter case, technologies such as Xen can be utilized. In both cases TCP/IP stack overhead is a concern. Preliminary work in 2006 with colleagues from Alexander Technology on the Jabberwocky cluster revealed that most aggregate bandwidth is achieved using dedicated NICs, with Xen hosts running on each CPU of a node slightly out-performing a single Linux OS over the whole node. This resulted in a paper presented at the Workshop on Xen in High-Performance Cluster Computing in December 2006. A problem encountered is that communication between Xen hosts running on a node is about 5 times slower than communication between processes within a single Linux OS. This makes running real applications over clusters with multiple Xen hosts per node impractical. This project will implement in Xen a bypass mechanism to optimize communication in this case on the latest version of Xen. Time permitting (e.g. for 24 unit projects), it will evaluate application-level performance on multiple NIC communication configurations, including those utilizing Xen. The project has the potential for publishable work. References See the links above, and also: * the IEEE Task Force on Cluster Computing home page. * P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris A. Ho, R. Neugebauer, I. Pratt and A. Warfield. Xen and the Art of Virtualization, In: Proceedings of SOSP 03: the Nineteenth ACM Symposium on Operating Systems Principles (2003) * Huang, W., Liu, J., Abali, B., Panda, D. A Case for High Performance Computing with Virtual Machines. In: Proceedings of ICS06: International Conference of Supercomputing, Cairns (2006)


    Energy-Efficient Data Gathering in Wireless Sensor Networks

    Supervisor: Dr Weifa Liang
    E-Mail: wliang@cs.anu.edu.au

    Background

    Wireless sensor networks have received significant recent attention due to their potential applications from civil to military domains including military and civilian surveillance, fine-grain monitoring of natural habitats, measuring variations in local salinity levels in riparian environment, etc. The sensors in sensor networks periodically collect the sensed data as they monitor their vicinities for a specific interest.

    The sensed data can be transmitted directly to the base station where the client queries are posted, the base station then processes the collected data centrally and responds to users queries. However, due to the fact that each sensor is equipped with energy-limited battery, the energy efficiency rather than the response time for each query is paramount of importance in sensor networks. The energy consumption for the above centralized processing is very expensive. Instead, we adopt energy-efficient in-network processing, i.e., the data gathering is implemented through a tree rooted at the base station, and the data at each sensor has to be filtered or combined locally before it is transmitted to its parent.

    Unlike most existing assumptions that each sensor only transmits a fixed length of message, we consider another data gathering, in which the length of the message that each sensor transmits to its parent is proportional to the sum of the message lengths of its children and itself. Thus, the total energy consumption at a node is proportional to the length of the transmitted message and the distance between the node and its parent in the tree. Specifically, on one hand, we aim to find such an energy efficient spanning tree rooted at the sink node that the network lifetime is maximized. On the other hand, we prefer the approximate result instead of the exact result in traditional data models for a given query, since the communications among the sensors are fail-prone. Under this circumstance, how close the approximate result is to the exact result becomes the essential issue.

    In summary, this project will be in pursue of energy efficiency on the basis of database modeling in sensor network environments. A student working on this project would be involved:

    • participate the research project.
    • study the problem complexity if taking the network lifetime as the sole metric
    • propose heuristic algorithm for finding a data gathering tree and provide approximate aggregate result for a given aggregate query
    • conduct experiments by simultation to validate the efficiency of the proposed algorithms.


    Energy-Efficient Aggregate Node Placement for Data Gathering in Wireless Sensor Networks

    Supervisor: Dr Weifa Liang
    E-Mail: wliang@cs.anu.edu.au

    Recent advances in microelectronic technology have made it possible to construct compact and inexpensive wireless sensors. Networks formed by such sensors, termed as wireless sensor networks, have been receiving significant attention due to their potential applications in environmental monitoring, surveillance, military operations and other domains. While these applications revealed tremendous potential for capturing important environment phenomenon using sensor networks, they have also posed certain associated limitations. One of the major limitations of sensor nodes that they are powered by low power batteries, which limit the network lifetime and impact on the quality of the network. Energy conversation in sensor network operations is thus of paramount importance. As most currently deployed sensor networks are used for monitoring and surveillance purposes, the sensed data generated by the sensor network must be available for users at the base station, data gathering that extracts and collects sensed data from sensors thus is one fundamental operation in sensor networks. The energy conservation on data gathering leads to prolongation of network lifetime. The study of energy-efficient data gathering poses great challenges due to the unique characteristics and physical constraints imposed on sensor networks. In this project we study the {it aggregate node placement problem for data gathering}, which aims to place a few powerful aggregate nodes to a dense sensor network such that the network performance can be significantly improved. Assume that a heterogeneous sensor network consists of two types of sensor nodes: cheap sensor nodes that have fixed identical transmission range and expensive aggregate sensor nodes that have various transmission ranges. Aggregate nodes have more energies, higher data transmission rate, and better processing and storage capabilities than sensor nodes. The main task of an aggregate node is to process and relay data for sensor nodes and the other aggregate nodes. Due to expensive cost associated with aggregate nodes, the number of them in a sensor network is very limited, thus, they need to be placed in the network carefully in order to increase the network connectivity, reduce the data gathering delay, and prolong the network lifetime. Specifically, the aggregate node placement problem for data gathering is as follows. Given a sensor network consisting of $n$ dense sensor nodes that are randomly deployed in a region of interest and a few number of expensive aggregate nodes $K$ ($K<< n$), the problem is to place the $K$ aggregate nodes in the region so that the network lifetime can be further maximised by answering data gathering queries.


    Porting a SPARC-Solaris simulator to x86-Linux

    Supervisor: Dr Peter Strazdins

    Simulation is playing an increasingly important role in the design and evaluation of current and future high performance computer systems. Sparc-Sulima is an open-source UltraSPARCxd SMP simulator developed by the CC-NUMA Project. Sparc-Sulima simulates an UltraSPARC processor by interpreting each instruction. It has timing models which predict an UltraSPASRC IIICu shared memory processor performance to a high degree of accuracy. Currently, Sparc-Sulima can simulate an unmodified Solaris (UltraSPARC) binary (executable program). When the program executes a system call trap instruction, the corresponding operating system effects are `emulated' (the simulated computer state is updated directly, rather than via simulating a lengthy and complex sequence of kernel-level instructions), and simulated execution then resumes. This mechanism is handled by the SolEmn emulation layer in Sparc-Sulima. For example, threaded programs (e.g. using the pthreads Posix thread libraries) ultimately access thread-related operating system services via the Solaris _lwp (lightweight process) trap instructions. A limitation of the current implementation of the SolEmn emulation mode is that it must be run on a Sparc-Solaris host. This limits greatly where the simulator can be used. Note however that simpler and more limited emulation modes have been ported to the x86-Linux; these modes can only run specially (and statically) linked binaries - and this excludes threaded programs! The aim of this project is to `port' the Sparc-Sulima to the most important platform - x86/Linux! The majority of work will be in SolEmn: the emulation of Solaris system calls using Linux system calls. Since, there is not always a one-to-one relation, this is not trivial. The strategy would be to gradually extend the number of ported calls and test with statically-programs of increasing functionality. Dynamically-linked programs will be more challenging, requiring the mmap system call among others. This project is related to the project Extended Threads Emulation in an SMP Computer Simulator. It is part of the CC-NUMA Project. References * Kevin Skadron et al. Challenges in Computer Architecture Evaluation, Computer, August 2003 * Any textbook on Unix systems programming. Solaris on-line documentation on system calls, including man -s 2 intro * from the Sparc-Sulima home page, there is a survey paper on computer simulation (citing many more references) and also papers on the implementation of Sparc-Sulima and SolEmn. There are manuals and source codes too.


    Characterization of Application Performance of Cluster Computers

    Supervisor: Dr Peter Strazdins

    A Beowulf-style cluster computer is a parallel computer using Commercial-off-the-Shelf switch-based network to communicate between the processors. The ANU Beowulf cluster Bunyip is such a cluster based on Fast (100Mb) Ethernet switches. Clusters have proved a highly cost-effective high performance computer model, and have largely displaced the traditional massively parallel computers built with expensive vendor-supplied networks. However, the COTS networks' raw communication speed has not kept up with the dramatic increases in processor speed, and provides a limit to the performance of many applications on clusters. Applications written for clusters use the Message Passing Interface (MPI) to both synchronize and communicate data between its processes running on different cluster nodes. Costs of communication include both latency (the time to send a small message) and bandwidth (the rate per byte of data transfer, for larger messages). An important aspect of research into clusters is then the characterization of applications' performance. At the lowest level: is performance limited by CPU, memory access, latency or bandwidth? For applications dominated by communication aspects, finer questions, such as can the application benefit from networks supporting bi-directional communication, and what is its potential for overlapping communication with computation, can be addressed. This information can be obtained from the literature (on the applications and benchmarks concerned), study of the application's source code and by experimentation. Of particular relevance is the paper on the characterization of the NAS parallel benchmarks listed below. This project will investigate this issue. It has a natural synergy with a sister project, which was undertaken by MIT student Tony Breeds in semester 1 2006, and will build upon the work in Tony's thesis. Instrumenting the MPI library and the use of performance counter libraries (which give access to hardware event counts) will be part of the new infrastructure developed by this project. Together, these projects will develop a Performance Evaluation Methodology, targeted at the Jabberwocky multicluster, which was installed in DCS in early 2006. References See the links above, and also: * the IEEE Task Force on Cluster Computing home page. * the HPC Challenge Benchmark is a cluster benchmark suite, with individual benchmarks chosen to exhibit a variety of performance characterizations * Architectural requirements and scalability of the NAS parallel benchmarks, Supercomputing '99


    Neural networks theory and applications

    Supervisor: Prof Tom Gedeon
    E-Mail: tom@cs.anu.edu.au

    Background

    A number of projects are possible in this area. I am interested in a number of topics, such as extracting rules from neural networks, information retrieval using neural networks, data mining and feature selection, cascade neural network structures, hierarchical neural network structures, and neural network applications. I have published papers in all of these areas with former students so there is plenty of earlier work to build on. No previous experience with neural networks is necessary. Most projects will use the very popular backpropagation neural network training algorithm. If you are interested in neural networks and would like to see if you might want to do a project in this area, please e-mail me tom@cs.anu.edu.au or come and see me.

    Example projects: i. Cascade neural network stuctures can be built automatically without making decisions as to the number of neurons required to solve a problem by adding single neurons. This project would investigate the use of larger chunks such as feature maps as cascade components, which would be useful for recognising images (including faces).

    ii. Rule extraction: Neural networks can learn complex tasks but face the problem that human beings do not trust them as they can not understand _why_ a particular decision is made. This project focuses on rule extraction for explanation.


    Fuzzy logic theory and applications

    Supervisor: Prof Tom Gedeon
    E-Mail: tom@cs.anu.edu.au

    Background

    A number of projects are possible in this area. I am interested in a number of topics, such as automated construction of fuzzy rule bases from data, hierarchical fuzzy systems, fuzzy interpolation, information retrieval using fuzzy logic,, universal approximators, and fuzzy logic applications. I have published papers in all of these areas with former students so there is plenty of earlier work to build on. No previous experience with fuzzy logic is necessary. If you are interested in fuzzy logic and would like to see if you might want to do a project in this area, please e-mail me tom@cs.anu.edu.au or come and see me. Example projects: i. Fuzzy document filtering: Web search engines return many documents which are not relevant to queries. Fuzzy techniques to enhance the results from search engines will be very useful. This project will investigate a number of fuzzy techniques for this purpose.

    ii. Combining uncertainty: Fuzzy logic is a technique which deals with uncertainty well. Sometimes data contains multiple kinds or sources of uncertainty. For example, a pedestrian could report that he saw someone he was quite sure was stealing something. He was not certain and his own reliability is another but different kind of uncertainty. This project is to develop some methods to combine different kinds of uncertainty in intelligence led investigation. (Data from a company specialising in this area will be available.)


    Bacterial evolutionary algorithms

    Supervisor: Prof Tom Gedeon
    E-Mail: tom@cs.anu.edu.au

    Background There are several optimisation methods inspired by natural processes, It has been shown that evolutionary algorithms are efficient tools for solving non-linear, multi-objective and constrained optimizations. The principle is a search for a population of solutions, where tuning is done using mechanisms similar to biological recombination. The operations the bacterial evolutionary algorithm were inspired by microbial evolution. Bacteria can transfer genes to other bacteria, so the rules are optimised in each bacterium.

    Example projects:
    i. Clustering using BEAs: Clustering is more usually performed by gradient descent algorithms. BEAs should allow more interesting clusters to be formed where there are complex evaluation criteria for cluster validity. Objects clustered could be images (e.g. see 1.i or 4.ii), or text (e.g. see 2.i, 2.ii). Clusters could have predefined target criteria, such as hierarchical structure of a certain depth.

    ii. Making pictures: If bacteria encoded the components of an abstract computer generated picture, an individual could identify nice and not nice images repeatedly to generate some 'art' which is tuned for their esthetic sense.

    iii. Scheduling: A previous student has worked on a GA for University exam scheduling. A comparison with bacterial algorithms would be an interesting project.


    Face Recognition

    Supervisor: Prof Tom Gedeon
    E-Mail: tom@cs.anu.edu.au

    Background

    A number of projects are possible in this area. I am interested in a number of topics, such as image processing for face recognition, HCI research tool to collect facial images, biologically plausible architectures for face recognition, building and modifying computer models of real faces. No previous work on face recognition is necessary. For the image processing topic, some experience with computer graphics would be useful. If you are interested in face recognition or some similar topic and would like to see if you might want to do a project in this area, please e-mail me tom@cs.anu.edu.au or come and see me.

    Example projects:
    i. View morphing: View morphing between two images of an object taken from two different viewpoints produces the illusion of physically moving a virtual camera. We generate compelling 2D transitions between images. Initially we will use the same face with small rotations and produce intermediate images to produce a realistic QuickTimeVR head.

    ii. Video annotation: Techniques such as PCA can recognise the same face so long as the rotation/expression/illumination is not too different. This can be used to recognise one person throughout a segment of video.

    iii. 3-D Face model: An average face 3D model can be constructed from a single face image and animated via rotation, a simple face expression model, or illumination. This can then be used to recognise faces in multiple poses, expressions or lighting.


    Data Cleaning Via Logic

    Supervisor: A/Prof Rajeev Gore
    E-Mail: rajeev.gore@anu.edu.au

    Background

    The Australian Bureau of Statistics stores its census information as a collection of tables with one row (record) for every individual with many columns like age, sex, marital status, current occupation etc. Such tables invariably contain errors, either introduced during data collection, or during data entry. For example, the census data may contain a record for a person named X who is aged 6, is married and in high school.

    Such data usually has an associated collection of extra conditions which capture important inter-relationships between the columns of the table. For example, the census table may have constraints like "the legal age of marriage is 16" and "a high-school student is aged between 11 and 18". According to these constraints, the entry for X is incorrect since the entry breaks both conditions, so the obvious question is to find a correction which is as close to the original as possible.

    The problem of finding an error in a data table is known as error-localisation, and the problem of changing an incorrect record so that it passes all the constraints is known as error-correction. The standard method for solving both of these problems is called the Fellegi-Holt method, but it is highly inefficient. Recently, Agnes Boskovitz, a phd student with the RSISE, has shown a deep connection between the Fellgi-Holt method and a well-known method from propositional logic called resolution.

    Efficient automated tools for performing resolution and associated logical deductions are readily available. The project is to implement the Fellegi-Holt method using these logical tools to gauge the efficacy of this logical approach to data cleaning. If successful, the prototype has huge potential in real world applications and will almost certainly lead to a publication in an international conference or journal.

    Feel free to email me if you are at all interested in this project.


    Interval Arithmetic for Computational Science

    Supervisor: Alistair Rendell
    E-Mail: Alistair.Rendell@anu.edu.au

    An interval is a continuum of real numbers bounded by its end-points. Interval arithmetic performs operations on intervals rather than using discrete floating point representations. As a consequence of this errors associated with numerical rounding are automatically propagated and accounted for. Moreover the mathematical properties of intervals can be exploited to develop entirely new algorithms.

    The concept of intervals is actually not new, but was considered by Moore in 1965. What is relatively new is the fact that some hardware vendors are now providing direct support for interval arithmetic within their compilers/hardware.

    In this project we wish to develop an interval code for evaluating the type of integrals that are used in quantum calculations on atoms and molecules. Evaluations of these integrals involves truncation of an infinite series, followed by the application of a variety of recursion relations. As a consequence the value of the final integrals reflect both a series truncation error, and a rounding error which can vary according to the recursive path taken. This project will study the balance between these two sources of errors and the implication of this in large scale quantum calculations.

    This project will build on work undertaken in 2005 by honours student Josh Milthorpe who studied the basic performance of interval operations and use of intervals in fast multipole calculations. You will also be part of a team working on interval related methods funded from an Australian Research Council Discovery grant.


    Automatic Differentiation

    Supervisor: Alistair Rendell
    E-Mail: Alistair.Rendell@anu.edu.au

    Every function, no matter how complicated, is executed on a computer as a series of elementary operations such as additions, multiplications, divisions and elementary functions like exp(). By repeated application of the chain rule to those elementary operations it is possible to obtain the derivative of any computed function in a completely mechanical fashion. This technique is applicable to computer programs of arbitrary length containing loops, branches and procedures. In contrast to explicit evaluation and coding of a derivative, automatic differentiation enable derivatives to be easily updated when the original code is changed.

    A number of programs are now available to perform automatic differentiation for code written in a variety of languages The aim of this project would be to use one of these tools to perform derivative evaluations for some computational science applications. Specifically in the modeling of systems at the atomic level derivatives with respect to atomic positions are hugely important determining things like the frequencies at which molecules vibrate. Myself and others have spent large quantities of time and effort deriving expressions for derivatives and then coding them. To automate this process and for it to be relatively efficient would be very significant. Your objective would be to explore this issue for some relatively simple computational molecular science codes.


    Numerical Simulation on Graphics Processing Units (GPU)

    Supervisor: Dr Alistair Rendell
    E-Mail: Alistair.Rendell@anu.edu.au

    Background

    Over the past 10 years graphics processing units (GPU) have become ubiquitous in desktop computers. Modern GPUs now allow substantial user programability that lets them be used for more general-pupose computation. Indeed in a recent paper that was presented at SC04, Fan et al detail their efforts to construct a "GPU Cluster for High Performance Computing". In this work the authors plugged in 32 GPUs to a Xeon cluster system and in so doing increased the total theoretical peak performance of the machine by 512Gflops - for a cost just under US$13,000!!

    Tracking this development there has been a number of groups developing programming environments that can exploit the GPU for general computations. Details of many of these are available at the GPGPU web site. Two of particular interest to this project are Brook and Sh. The former is built on top of C while the latter is built on top of C++. Brook for example extends standard ANSI C with the concept of a "stream", which is a new data type representing data that can be operated on in parallel.

    The aim of this project is to investigate the potential for using the GPU in molecular science computations. The initial target application is likely to be N-body simulations, as typified by the simulation of atom motions. Careful consideration will be given to accuracy, since currently the GPU is limited to single precision computations. Thus the project will address two issues, i) the in principle feasibility of using the GPU for these types of problems, and ii) the scientific usefulness given the current single precision limitation.


    InfiniBand RDMA Remote Memory Swapping

    Supervisors: Richard Alexander, Alistair Rendell, Peter Strazdins
    E-Mail: Richard@AlexanderTechnology.com, Alistair.Rendell@anu.edu.au, Peter.Strazdins@cs.anu.edu.au

    Modern RDMA capable networks such as InfiniBand provide low latency of a few microseconds and high bandwidth up to 10Gbps. This has significantly reduced the gap between access to local and remote memory in modern clusters. A remote paging system has been developed by DK Panda et al to exploit the RDMA capabilities of InfiniBand, using a High Performance Networking Block Device (HPBD) which serves as a swap device for Linux kernel virtual memory for efficient page transfer to and from remote servers. This HPBD has been implemented for the Linux kernel 2.4 as a kernel module. This project will implement the RDMA HPBD as a kernel module for Linux kernel 2.6, and investigate ways to improve the performance of the RDMA HPBD to reduce copy cost and utilize the zero-copy feature of RDMA operations. Alexander Technology as part of the Jabberwocky project will provide hardware for this project.

    References:

    • Shuang Liang, Ranjit Noronha, DK Panda, Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device, Proceedings of the 2005 IEEE International Conference on Cluster Computing.
    • Jiuxing Liu, Jiesheng Wu, DK Panda, High Performance RDMA-Based MPI Implementation over InfiniBand. Proceedings of The 17th Annual International Conference on Supercomputing (ICS'03).
    • InfiniBand Trade Association home page


    MPI and Global File System I/O over InfiniBand using Virtual Lanes

    Supervisors: Richard Alexander, Alistair Rendell, Peter Strazdins
    E-Mail: Richard@AlexanderTechnology.com, Alistair.Rendell@anu.edu.au, Peter.Strazdins@cs.anu.edu.au

    InfiniBand networks provide for multiple, parallel system area networks using the same physical hardware, using a concept called Virtual Lanes. Virtual Lanes allow the network to be split for a variety of purposes including the separation of HPC cluster management, message passing and I/O traffic. There are a number of parallel filesystems used in heterogeneous Linux clusters, with the most widely used being PVFS2, GPFS, Lustre in addition to the legacy NFS system. One possibility with large SMP nodes in a cluster is to use InfiniBand virtual lanes and operating system virtual machines in a way that allows nodes of a cluster to be used as both compute nodes and filesystem storage nodes. This project will implement the Lustre filesystem on a small test cluster with large SMP nodes and the InfiniBand interconnect, using virtual machines in a novel way to segregate compute functions from I/O.

    References:

    • J Liu, A Vishnu, DK Panda, Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation. Proceedings Supercomputing 2004.
    • J Cope, M Oberg, H Tufo and M Woitaszek, Linux Multi-Cluster Environments.

    Declarative theorem-proving

    Supervisor: Dr Michael Norrish
    Email: michael.norrish@nicta.com.au

    This project would suit a student keen to work in an area involving logic and proof, and the development of infrastructure to support these activities.

    When specifying systems formally, a significant error-finding benefit can accrue simply from writing everything down in an expressive logic, and getting it all to type-check. Working in this way allows users to quickly deal with the “low-hanging fruit”; important bugs that are easy to find.

    The HOL system currently doesn’t allow for this model of work very well. Part of the reason for this is that script files for HOL inter-mingle SML (which is the theorem-prover’s implementation language) and embedded bits of logic. We would like to check the logic without having to be able to parse (and execute) SML as well.

    Therefore, the first aim of this project is to design a simple file format in which users could write “SML-less HOL”, and to then implement at least two tools for manipulating this format. The first tool would be the type-checker. It would make sure that the definitions and other logical material in the file actually made sense at the level of types. The second tool would do all this and also produce a genuine HOL theory, using the various definitional facilities already present in HOL. This file format would need to include mechanisms for referencing other files, where previous logical developments had been built up.

    The final stage of the project would be to implement a simple declarative language for writing proofs. The issues here are similar to those for definitions; to escape the use of files that include (SML) code, and move to formats that are easy to parse and manipulate as data.

     

    Finding Models of Unsolvable First Order Goals

    Supervisors: Dr Michael Norrish
    Email: michael.norrish@nicta.com.au

    When using an interactive proof assistant, a powerful tactic for finishing goals is the use of first order proof procedures. Unfortunately, these procedures are all-or-nothing affairs: if they succeed, they succeed completely, but if they fail to prove the goal, the user knows only that the provided lemmas and assumptions were not sufficient to allow the procedure to find a proof. The advent of model generation tools promises a way out of this painful situation. These tools can be used to generate what is effectively a counter-example to false goals. This is much more helpful for the user. Either the tool's output really is a counter-example and their proof will never succeed, or the counter-example will be impossible because the user has forgotten to tell the tools some relevant lemma from a database of known facts. The student's task in this project would be to implement a connection between two existing tools: the interactive system HOL4, and the model generation tool Darwin. This connection would provide a user-friendly implementation of the scheme described above: when an invocation of the first order prover in HOL failed, the connection implementation would translate the HOL goal into terms that Darwin could understand, call Darwin, and then finally translate Darwin's provided counter-example back into a human-readable form for the user's benefit. This project would suit a student familiar with functional programming (HOL is written in SML), and happy with the concepts of first order logic (terms, formulas, quantifiers). The student would be jointly supervised by Michael Norrish and Peter Baumgartner.


    Optimizing for Design Patterns

    Supervisor: Dr Steve Blackburn and Dr Alonso Marquez
    Email: Steve.Blackburn@anu.edu.au and alonsomarquezes@yahoo.es

    Background

    The software engineering benefits of design patterns are widely understood. Consequently their use has become ubiquitous in industry. This widespread uptake is in spite of many patterns incurring significant performance penalties. This phenomena is only just emerging and so far has not received attention from the programming languages community.

    Previously we have shown that bytecode transformations at class loading time can be used to make significant performance optimizations for Java programs. Our approach is powerful and unorthodox, sidestepping heavyweight pointer analyses typically used by the mainstream compiler community. This project will leverage this approach to optimize for some heavily used design patterns.

    We anticipate that this work could lead to publication.


    Performance profiling for Java

    Supervisor: Dr Steve Blackburn and Daniel Frampton
    Email: Steve.Blackburn@anu.edu.au and Daniel.Frampton@anu.edu.au

    Background

    A popular profiling technique is to use sampling. Periodically the program counter is sampled, and over the run of a program (involving many thousands of samples), a histogram is produced. The program counter values are mapped back to source lines of code to identify which source code is most heavily executed. This is often very helpful in debugging performance problems. Refinements to this idea include walking the stack at each sample point, to enrich the context. Another refinement is to use hardware performance counters. Hardware performance counters can be set to trigger interrupts after a given number of events (such as L2 cache misses, for example). This can be used to more specifically narrow down performance problems by identifying what code is triggering particular performance problems. Our group is leading work in Java performance and optimization and we are responsible for key research infrastructure to this end. One crucial tool we lack is the capacity to do performance counter driven profiling. This project would build a performance counter based profiler for Jikes RVM (the widely used research virtual machine), and would analyze the performance of some key Java programs which are currently not well understood.


    Sane Garbage

    Supervisor: Dr Steve Blackburn and Daniel Frampton
    Email: Steve.Blackburn@anu.edu.au and Daniel.Frampton@anu.edu.au

    Background

    Garbage collection is key to modern languages like Java and C#. Our group plays a leading role in the development of garbage collection algorithms and we maintain the leading infrastructure for memory management research, MMTk. Garbage collectors are notoriously difficult to debug, as they subvert most assumptions about program behavior (by manipulating the object graph behind the back of the application). There are very few useful tools for debugging garbage collectors---typically we must resort to primitive methods such as the careful use of print statements!

    This project will develop a heap sanity checker. The heap sanity checker will be completely unobtrusive, only observing, not altering the state of the system. The sanity checker will always be present in the system, though inactive unless invoked at the command line. The checker would report on dangling pointers, uncollected garbage, etc. The tool would also be extensible, allowing the addition of algorithm-specific checks (such as reference count checks for a reference counting collector). This tool would be invaluable to garbage collection research.


    An allocator for machine code

    Supervisor: Dr Steve Blackburn and Daniel Frampton
    Email: Steve.Blackburn@anu.edu.au and Daniel.Frampton@anu.edu.au

    Background

    Modern virtual machines use Just In Time (JIT) compilers to dynamically generate machine code. The placement of the code in memory can have significant performance impacts. For example, if two contemporaneously hot methods are located in memory in such a was as to conflict within the instruction cache, then performance will suffer. Furthermore, the needs of code allocation are somewhat different to conventional allocation because code objects tend to be larger than normal objects and the size distribution is much less homogeneous. Finally, code allocation is rare, so a slightly more expensive allocation mechanism that is more careful with space may be far preferable. Currently in Jikes RVM code is allocating using a conventional allocator. This project would develop a new allocator that would allow the user a higher degree of flexibility in the placement of code (possibly specifying cache lines to avoid, or cache lines to favor), and would be better tailored to the space and time requirements of code allocation.


    Java to C++ Translation

    Supervisor: Dr Steve Blackburn and Robin Garner
    Email: Steve.Blackburn@anu.edu.au and Robin.Garner@crsrehab.gov.au

    Background

    MMTk is a memory management toolkit that is an important tool for memory management research. MMTk is written in a special dialect of Java which avoids dynamic memory allocation and the use of Java libraries which might perform such allocation, and adds to Java unboxed types for memory primitives such as Address, Word, and ObjectReference. The result is a powerful, flexible and high performing toolkit which allows researchers to rapidly experiment with garbage collection ideas. MMTk was implemented within Jikes RVM, a Java virtual machine written in Java. Since then MMTk has been ported to another virtual machine written in C. Rather than port the source code from Java to C, it is desirable to somehow produce a C library usable by the C program. In the first instance, this was done using the gnu compiler for java, gcj. We would like a more general solution. This project will write a Java to C++ translator for MMTk. The limited subset of Java used by MMTk makes the project feasible. We envisage the automatic translator both generating C++ files and headers with inline functions in them, allowing key elements of the interface to be inlined into the host virtual machine, greatly improving performance.


    The Active National University: applying location awareness

    Supervisor: A/Prof Chris Johnson
    E-Mail:
    cwj@cs.anu.edu.au

    Background

    The proliferation of wireless access points and mobile access devices across university campuses provides an opportunity to use information on people's locations, provided by their wireless computing devices, to support their working and recreational life on campus. The Active Campus project at University of California San Diego is an example of a prototype system. In any such system there are technical issues about the effectiveness of the wireless network as a location sensor and information carrier, issues of who provides and administers access, identification of devices and people, the design of applications and their interfaces on mobile devices, privacy and ethics in tracking and informing on people, and the effectiveness of systems as people use them and adapt to their use. Wireless location on this scale has low precision and is highly variable. The ANU is deploying a network of 802.11b (WiFi) wireless access points; it's time to turn the campus into the Active National University. Software Engineering honours student research project Several projects in this area will potentially support each other:

    1. Support framework for applications: (software engineering biassed) Analyse, design and implement a robust, extendable framework that demonstrates support for a small suite of basic applications suited to the ANU campus. Mobile computing platforms can be those anticipated in the near future, and the applications can be selected or invented to support education, research, and social activities. The system must incorporate a strong flexible emphasis on privacy and security.

    2. Location Authority and Service: Various structures and requiremetns for Location Authorities for location based services have been suggested, in particular by Shafer. A robust, long-life, easily maintained Location Authority is essentially a database that maps the device identity of wireless access points onto their geographic locations on campus, but it also provides procedures and utilities for creating and maintaining the actual data as access points are installed, moved, go down and return to service. The LA should interoperate with Geographic Information Systems such as those giving the coordinate systems provided by available maps of ANU campus as a demonstration example.

    3. Novel applications: Design, implement and evaluate the effectiveness and social issues of novel location-aware applications on the Active National University framework system. Human and social factors are as important here as communications and information services.

    More information

    1. Guiding, Tracking and perceptual computing: http://www.designing-ubicomp.com/Arbeitsdateien/7_tracking+guiding.html
    2. Ubiquitous Computing (in Daily Wireless): http://www.dailywireless.org/modules.php?name=News&file=article&sid=1631
    3. ActiveCampus: Experiments in Community-Oriented Ubiquitous Computing, William G. Griswold, Patricia Shanahan, Steven W. Brown, Robert Boyer, Matt Ratto, R. Benjamin Shapiro, Tan Minh Truong, IEEE Computer, October 2004 pp. 73-81
    4. The Carrot Approach: Encouraging Use of Location Systems, by Kieran Mansley, Alastair R. Beresford, David Scott. in N. Davies et al .(eds) UbiComp 2004, LNCS 3205, pp 366-383, 2004.
    5. Everyday Encounters with Context-Aware Computing in a Campus Environment Louise Barkhuus and Paul Dourish in N. Davies et al .(eds) UbiComp 2004, LNCS 3205, pp 232-249, 2004
    6. Steven A. N. Shafer, Location Authorities for Ubiquitous Computing, in Mike Hazas, James Scott, and John Krumm (eds), UbiComp workshop on Location-Aware Computing, 2003. at http://www.ubicomp.org/ubicomp2003/workshops/locationaware

  • Updated:  18 February 2011 / Responsible Officer:  JavaScript must be enabled to display this email address. / Page Contact:  JavaScript must be enabled to display this email address.