Honours topics
Half of your time as an honours student is spent working on a project. But first you have to find a project topic.
The "official" reference for projects proposed by potential supervisors is the CECS projects database.
There are projects there available for all levels of research, including short projects, summerscholarship projects, Honours projects, Masters and PhD projects. All potential research students at any level are urged to browse it.
If you see a project that looks interesting, email the potential supervisor about it. Don't be afraid to discuss possible variations to the listed project: what appears on the web site is generally more of a suggestion than a rigid specification.
You don't have to be constrained by what you see on the project website. If you have something that you would like to work on as a project, then feel free to discus it with the honours convener to see if it could form the basis of an honours project and to identify a possible supervisor. Look at the web pages of Computer Science staff to find out their research interests. Remember that projects may also be supervised by people outside the College, or even ontside the University: from CSIRO or NICTA, for instance.
Former Project Topics
For your interest, here is an archive of Honours project proposals from previous years. Some of them are quite ancient now, of course, but they may help to give you ideas of the kind of thing which is suitable.2006 Honours project proposals
Apply Dynamical Bayesian Network to Query Digital Forensics
Supervisors:
Professor Terry Caelli, VISTA Program, NICTA and RSISE, ANU
Dr Nianjun Liu, VISTA Program, NICTA and RSISE, ANU
Email:
Terry.Caelli@anu.edu.au
and Nianjun.Liu@nicta.com.au
Digital forensics undertakes the post-mortem reconstruction of the causal sequence of events arising from an intrusion perpetrated by one or more external agents, or as a result of unauthorised activities generated by authorised users, in one or more digital systems. The field of digital forensics covers a broad set of applications, uses a variety of evidence and is supported by a number of techniques. Application areas include forensic accounting, law enforcement, commodity flow analysis and threat analysis. Forensic investigations often focus on unusual and interesting events that may not have arisen previously. A major objective of a digital investigation is to extract these interesting pieces of evidence and to identify the causal relationship between this evidence.
This project aims at extending an existing Dynamical Bayesian Network model developed for digital forensics by investigating a number of possible topics. The model developed uses a Bayesian network and hidden Markov model network structure to (i) estimate typical digital crime scenario models from data and (ii) given such models, infer the most likely criminal act given current observations and past criminal acts.
First,the addition of multi forward linkages in the BN+HMM network structure to allow direct linkages between BN nodes at consecutive time intervals. Second is the design of an SQL based query tool to explore the activities of criminals and their interactions and explain what happened in the past, as well as predict what will happen in the near future. Finally, the application of a graphical model to data mining of relational digital forensic databases, including construction of a relational pattern structural database for known types of digital crime portfolios and their associated forensics Bayesian Network models.
Applicants must have a major in information technology, computer science, or electrical engineering, preferably with strong mathematics background and understanding of probability and statistics, with experience in the data analysis and data mining, pattern recognition and machine learning, and also having excellent programming abilities (MATLAB, C/C++ and JAVA).
Contact Dr Nianjun Liu (nianjun.liu@nicta.com.au)
Intelligent Environmental Query on Spatial Data
Supervisors:
Dr Nianjun Liu, VISTA Program, NICTA and RSISE, ANU
Professor Terry Caelli, VISTA Program, NICTA and RSISE, ANU
Email:
Nianjun.Liu@nicta.com.au
and
Terry.Caelli@anu.edu.au
Analysis of spatial information in natural resources management is crucial to support a decision making process. However, with the advent of various technologies to acquire the data, analysis of multiple spatial data becomes a very challenging area. Those technologies will produce different accuracy and different resolution in the data. In spite of multi representation of spatial data, evidences of an area can be from different time and different observer’s view that makes combining those evidences is quite complicated. Combining spatial data or evidences is not just simply combining evidences from different technologies, but it is also combining multi criteria evidences.
Australian Bureau of Rural Science (BRS) has developed a system known as multi criteria analysis shell for spatial decision support (MCAS-S). The project is to incorporate an option for use of (Dynamical) Bayesian network approaches to model multiple types of evidences for intelligent environmental query and decision supports.
The project involves various techniques: image processing to preprocess the spatial GIS data, machine learning, pattern recognition and probability theory. Applicants must have a major in information technology, computer science, or electrical engineering, preferably with strong mathematics background and understanding of probability and statistics, with experience in the data analysis and data mining, pattern recognition and machine learning, and also having excellent programming abilities (MATLAB, C/C++ and JAVA).
Contact Dr Nianjun Liu (nianjun.liu@nicta.com.au)
Intelligent Land Planning on Relational Spatial Data
Supervisors:
Dr Nianjun Liu, VISTA Program, NICTA and RSISE, ANU
Professor Terry Caelli, VISTA Program, NICTA and RSISE, ANU
Email: Nianjun.Liu@nicta.com.au
and Terry.Caelli@anu.edu.au
Multi-criteria approaches to the analysis of complex issues in environment decision systems have found wide application across business, government and communities around the world. Such approaches may be readily applied in the context of land planning, which is a prerequisite to the development of a city, town or suburb. Generally, planners collect a range of information about an area, including information about natural resources, topography, demographics, political issues, economic characteristics and proximity to neighbouring settlements and services, and combine this information to make planning decisions. Computer aided Multi-criteria decision support tools allow for measurement and analysis of alternatives or options, involving a variety of both qualitative and quantitative dimensions.
The project involves collaboration with the ACT Planning and Land Authority (ACTPLA) to present a sample demonstration of an Intelligent Land Planning tool using Bayesian Networks. Specifically, it will aim to develop a tool for the selection of optimal sites for community services within existing areas of settlement in the ACT, including recreational facilities, schools, childcare centres, aged care facilities and community centres. The factors of interest include existing settlement patterns, demographics (current and anticipated), future development, available land resources, existing services and community need.
Bayesian Network is designed to apply when there is uncertainty about evidence and how it should be combined in decision making. The proposed approach is to use a supervised strategy whereby experts provide known thematic layers of the land cover GIS spatial database as known successful decisions. The Bayesian Networks then trained to optimize the predications of such decision with the aim of applying the optimal decision model to new situations or scenarios.
After exploring the ACT spatial database and other data sources, the scholar will identify the relevant decision factors with the aid of ACTPLA experts and will then build the Bayesian Network model. After the iterative data mining on the relational database, the model will be continuously learned and its structure and parameters adjusted accordingly. Finally, the scholar will create the new tool and test it in the real-world context within ACTPLA.
The project involves various techniques: spatial relational database, Structure Query Languages (SQL), machine learning, pattern recognition and probability theory. Applicants must have a major in information technology, computer science, or electrical engineering, preferably with strong mathematics background and understanding of probability and statistics, with experience in the spatial data base and data mining, pattern recognition and machine learning, and also having excellent programming abilities (MATLAB, C/C++ and JAVA).
Contact Dr Nianjun Liu (nianjun.liu@nicta.com.au)
Visualization of Sensor Network Localization Process
Supervisors:
Dr Baris Fidan, SEaCS Program, NICTA and RSISE, ANU
Mr Brad Yu, SEaCS Program, NICTA and RSISE, ANU
E-Mail:
Baris.Fidan@anu.edu.au
and Brad.Yu@anu.edu.au
Can one determine the sensor location without using expensive global positioning systems (GPS)? Yes, fundamental theoretical advancement has enabled us to do that, with a small number of anchor sensors whose location information is available and without knowing the exact locations of the other sensors. The applications are of great interest to both civil and defence authorities. The sensor network localization problem is one of determining the three-dimensional positions of all sensors in a network given the exact positions of some (anchor sensors), and a number of inter-sensor distance measurements. This project is to develop graphical tools for visualizing the localization process, possibly starting from many different parts of the network (i.e. seeds for localization) and run them in parallel. The outcomes of the Honours project will be used in some demonstration and testing tasks of a broader NICTA research project. The Honour student should be familiar with MATLAB, capable of programming in C/C+ or Java, and able to produce GUI and/or graphical tools. The student will receive close and frequent supervision from both Dr. Fidan and Mr. Yu, and will find opportunities to interact with the Chief Scientist of NICTA, Prof. Brian Anderson, and our overseas collaborators in US and Belgium. He/she will have chance to work closely with other ANU/NICTA researchers and PhD students as well.
Modelling and Control of Autonomous Multiagent Systems
Supervisors:
Dr Baris Fidan, SEaCS Program, NICTA and RSISE, ANU
Mr Brad Yu, SEaCS Program, NICTA and RSISE, ANU
E-Mail:
Baris.Fidan@anu.edu.au
and Brad.Yu@anu.edu.au
How do swarms of agents, like fish or unmanned autonomous vehicles, manage to move in a formation, split a formation, merge formations, follow a leader, change formation and so on without colliding, and without a master device to drive them? Problems like this are the fascinating province of cooperative control, with various defence and civil applications. This project aims to develop efficient models of various autonomous multiagent systems as well as to design and test system architectures and controllers that coordinate the agents in these systems for various missions. A typical mission may involve forming a certain team structure (formation), maintaining or changing such a formation, intelligent guidance for cohesive motion of the whole team, etc. The outcomes of the Honours project will be used in some demonstration and testing tasks of a broader NICTA research project. Visualization and testing are important in this Honours project, and development of simple graphical tools may be needed. Interested students should be capable of programming in C/C+ or Java or Matlab, and able to produce GUI and/or graphical tools. The student will receive close and frequent supervision from both Dr. Fidan and Brad Mr. Yu, and will find opportunities to interact with the Chief Scientist of NICTA, Prof. Brian Anderson, and our overseas collaborators in US and Belgium. He/she will have chance to work closely with other ANU/NICTA researchers and PhD students as well.
Optimal Control and Path Planning for Multi-Vehicle Systems
Supervisors:
Dr Baris Fidan, SEaCS Program, NICTA and RSISE, ANU
Mr Brad Yu, SEaCS Program, NICTA and RSISE, ANU
E-Mail:
Baris.Fidan@anu.edu.au
and Brad.Yu@anu.edu.au
Inspired from biological swarms of bees, birds, ants, fish, etc., many researchers today study efficient coordination of teams of aerial, ground and underwater vehicles for various cooperative missions. A typical mission may involve forming a certain team structure (formation), maintaining or changing such a formation, intelligent guidance for cohesive motion of the whole team, etc. without a master device to drive the vehicles in the team. Efficient coordination of vehicles in such missions have various defence and civil applications. This project aims to design optimal control and path planning schemes for various tasks such as formation maintenance during manoeuvre, change, merge and splitting of groups of autonomous vehicles, etc. The outcomes of the Honours project will be used in some demonstration and testing tasks of a broader NICTA research project. Interested students should be capable of programming in C/C+ or Java or Matlab, and have good knowledge in basic control systems and mathematics. The student will receive close and frequent supervision from both Dr.Fidan and Mr. Yu, and will find opportunities to interact with the Chief Scientist of NICTA, Prof. Brian Anderson, and our overseas collaborators in US and Belgium. He/she will have chance to work closely with other ANU/NICTA researchers and PhD students as well.
Instance Based Methods
Supervisors: Dr Peter Baumgartner, Prof John Slaney E-Mail: Peter.Baumgartner AT nicta.com.au
Formal logic provides a mathematical foundation for many areas of computer science. Logical languages are used as specification language within, e.g., program development and verification, hardware design and verification, relational databases, and many subfields of Artificial Intelligence. Automated Deduction is concerned with the design and implementation of algorithms based on logical deduction for solving problems in these areas. Instance Based Methods are a relatively new family of methods for Automated Deduction; they share the principle of carrying out proof search by maintaining a set of instances of input formulas and analyzing it for satisfiability until completion.
I offer several research-oriented honours projects centered around Instance Based Methods, in particular in connection (experimentation/extension) with our own Darwin system (http://goedel.cs.uiowa.edu/Darwin/).
A circuit-based algorithm for solving quantified Boolean formulae
Supervisor: Dr Jussi Rintanen E-Mail: rintanen-AT-informatik.uni-freiburg.de
Dr Rintanen will join NICTA's Knowledge Representation and Reasoning Program (KRR) in January 2006. He is currently with the University of Freiburg, Germany. The local contact for this project before Dr Rintanen's arrival is Dr Sylvie Thiébaux.
Background
Following the success of satisfiability algorithms in solving a wide range of practically important problems in computer aided verification, artificial intelligence and other areas, the idea of using algorithms for quantified Boolean formulae (QBF) for similar but more general problems has become increasingly attractive. One of the problems present in the current QBF algorithms is that they are defined only for QBF that have their matrices in conjunctive normal form (CNF), and the most natural representations of many problems in QBF involve arbitrary Boolean formulae or circuits and do not have simple translations to CNF.Honours project
The goal of the research project is to develop a proof system for QBF that are represented as arbitrary Boolean circuits, not restricted to CNF, and to develop an efficient QBF algorithm based on the proof system. In comparison to circuit-based proof systems for the classical propositional logic, this involves identifying proof rules for quantifiers and universally and existentially quantified variables, as well as rules for efficient scoping of variables. The QBF algorithm will be compared to CNF-based QBF algorithms which are used in connection with translations of circuit-QBF into QBF in CNF. T-anu.edu.auCompact representation of schematic operators in planning as satisfiability
Supervisor: Dr Jussi Rintanen E-Mail: rintanen-AT-informatik.uni-freiburg.de
Dr Rintanen will join NICTA's Knowledge Representation and Reasoning Program (KRR) in January 2006. He is currently with the University of Freiburg, Germany. The local contact for this project before Dr Rintanen's arrival is Dr Sylvie Thiébaux.
Background
Translation into the classical propositional logic has been proposed as a powerful and general solution technique for classical AI planning. The actions of many planning problems are best represented in terms of schemata that are instantiated with given objects to obtain the actual actions. For example, a generic schema for the movement of vehicles between locations is IN(V,X) => -IN(V,X), IN(V,Y) in which V can be instantiated with any vehicle and X and Y with any locations. Kautz and Selman [1996] and Ernst et al. [IJCAI'1997] have considered compact representations of schematic actions in the propositional logic. However, these works restrict to sequential plans, and there is a lot of potential to improve the efficiency of planning further by considering some notion of parallel plans which allows the execution of several actions simultaneously.Honours project
The goal of the research project is to develop compact representations of schematic actions for parallel plans in the propositional logic. If the regularities in the actions described by an action schema are recognized, a more compact representation of the planning problem is possible, potentially leading to much more efficient planning. The work includes extending the ideas presented by Kautz and Selman and Ernst et al. to parallel plans as well as developing new techniques for compact action representation, and then implementing these techniques as a part of a planning system.Supercom
Supervisor:
Dr Sylvie Thiébaux,
Dr Anbulagan,
Dr Olivier Buffet,
Dr Jinbo Huang,
Dr Yannick Pencolé
E-Mail:
sylvie.thiebaux-AT-anu.edu.au,
anbulagan-AT-nicta.edu.au,
olivier.buffet-AT-nicta.edu.au,
jinbo.huang-AT-nicta.edu.au,
yannick.pencole-AT-anu.edu.au
Background
Composite systems feature simple components organised into a highly reconfigurable and modular architecture, due to which they can appropriately assemble to achieve complex operations. Examples include web and grid services, energy distribution systems, water management systems, chemical plant control systems, telecommunication and computer networks, to name but a few.
In the "Supercom" project, we are interested in building Artificial Intelligence tools to supervise such composite systems. These will need to confer the system the ability to self-monitor (diagnose) and self-organise (assemble, reconfigure) to achieve optimal performance.
In this framework, various projects are available on topics such as decentralised diagnosis [1], decentralised planning [2] or diagnosis using satisfiability and knowledge compilation methods [3]. The main supervisor will depend on the project, but interactions with other researcher should appear naturally. A strong background in computer science, software engineering, or mathematics, and an interest in artificial intelligence are required, as well as good programming skills.
References:
[1] Yannick Pencolé, Marie-Odile Cordier A formal framework for
the decentralised diagnosis of large scale discrete event systems and
its application to telecommunication networks, Artificial Intelligence,
Volume 164, Number 1-2, May 2005.
[W: http://www.informatik.uni-rier.de/%7Eley/db/journals/ai/ai164.html#PencoleC05]
[2] E. Amir and B. Engelhardt, Factored Planning, in 18th Intl' Joint
Conference on Artificial Intelligence (IJCAI'03), 2003.
[W: http://www.cs.uiuc.edu/~eyal/papers/factored-plans-ijcai03.pdf]
[3] Look-Ahead Versus Look-Back for Satisfiability Problems , Chu Min
Li and Anbulagan, CP, 1997, Linz, Austria.
[W: http://web.rsise.anu.edu.au/~anbu/papers/CP97Anbulagan.pdf]
SML_02: Go Opening Book: Learn from a million recorded games how to open a game of Go (up to 3 projects)
SML_04: The
Smart Black Box: Build this little black box that will learn
when to turn devices on and off for you.
SML_06: Implementing reinforcement learning algorithms in Python:
Help build the world's best machine learning code repository!
SML_07: Representing Memory in Machine Learning?:
Using Probabilistic State Representations for Partially Observable
Markov Decision Processes
SML_08: Probabilistic
temporal planning:
Help us build a better Microsoft Project
SML_09: Training
auto-pilots:
Use reinforcement learning to train smart non-linear controllers for flying planes.
SML_11: Unwanted
audio signal filtering:
Fed up with listening to radio ads? Filter them out and show others how.
SML_12: Sport
Signal Analysis:
A tool to monitor athletes' efforts on various gym equipments.
SML_15: Python Package Manager:
An open source package manager for installing and distributing Python
packages.
SML_16: Nearest Neighbors:
Implement an efficient nearest neighbor algorithm for the world's best
machine learning library in Python.
SML_17: Decision Trees and Forests:
Generate decision trees and forests in Python and C for CREST, our
machine learning library.
SML_18: Processing
of Face Images
Improve a face recognition system
Inferring structure from word processing documents
Supervisors: Dr Ian Barnes(DCS, ANU), Dr Peter Bailey (CSIRO), Dr David Hawking(CSIRO)
E-Mail: Ian.Barnes@anu.edu.au
E-Mail: Peter.Bailey@csiro.au
E-Mail: David.Hawking@csiro.au
Backgroud:
Word processing documents are basically flat lists of paragraphs, with various formatting information attached. There is no marking of important structural features like sections and subsections, title, abstract, lists and sublists. In order to do good information retrieval, it would be a big advantage to be able to deduce the structure from the formatting, and rewrite the document in a format where its structure is made explicit. There are various options available here depending on your interests; we will work to refine the project goals with you in the first weeks. Minimally it will involve some combination of developing software for generating and manipulating (probably XML-based) documents from word processing files, experimenting with different variations, and doing a rigorous assessment -- based on information retrieval evaluation methods -- of the value added to documents by the techniques you develop.
RFID implementation and privacy
Supervisors: Dr Michael Shen (CSIRO), Mr Tom Worthington (DCS,ANU) and
A/Prof. CHris Johnson (DCS, ANU)
E-Mail: Michael.Sheng@csiro.au
E-Mail: Tom.Worthington@anu.edu.au
E-Mail: cwj@cs.anu.edu.au
Background
Radio Frequency Identification (RFID) technology uses radio-frequency communication to transfer data between tagged objects and readers, creating the possibility of a physically linked world where every object is numbered, identified, catalogued, and tracked. There are many innovative and high payoff RFID applications such as tracking doctors/patients in hospitals to improve efficiency, marking airline bags to reduce lost bags, tracking people in refugee camps to ease the post-disaster relief (e.g., helping people find their missing relatives). Currently, both academia and industrial major players like MIT, Stanford, SAP, Microsoft, and Sun are investigating heavily in R&D to enable RFID to realize its potential. This project centres around the investigation and implementation of RFID technologies. In particular, we will first build an experimental environment using RFID tags and readers and perform RFID data translation and transformation. We will also study the RFID standards including EPCGlobal network. The final stage of the project is the design and implementation of an RFID application for tracking people in refugee camps based on the open source Sahana (http://www.sahana.lk/). Subject to the studentu2019s interests, there is also a chance to investigate the privacy issue associated with this application. This project presents an exciting opportunity for the student to obtain sound and fundamental knowledge of the hot area and gain research skills and experience. The student will work with researchers from CSIRO ICT Centre and ANU. Skill Prerequisites: No prior knowledge on RFID is required, but good programming skills in Java is essential. Knowledge on database (e.g., Oracle), XML, and Web services is a plus.
Tableaux reasoning for web service composition
Supervisors: Dr Mark Cameron (CSIRO) A/Prof Rajeev Gore (RSISE, ANU)
E-Mail: Mark.Cameron@csiro.au
E-Mail: Rajeev.Gore@anu.edu.au
Background
The Information Engineering Lab has a logic-based service composition framework that is being applied across a number of domains, including Health Data Integration (Cameron et al) http://cips.eas.asu.edu/iiwebfinalproceedings/31.pdf and environmental management http://www.iemss.org/iemss2002/proceedings/pdf/volume%20uno/304_cameron. pdf . A published method for web Service composition (Berardi et al 2004) http://www.inf.unibz.it/~calvanese/papers-html/ICSOC-2004.html http://www.inf.unibz.it/~calvanese/papers-html/TES-2004.html (Berardi et al 2005) http://www.inf.unibz.it/~calvanese/papers-html/VLDB-2005.html relies on a translation of finite-state machine representations of a composition goal and stateful Web services into a modal/description logic, PDL. Tableaux theorem proving can then be applied to find a web service composition that statisfies the goal. This project would aim to replicate and implement that basic approach, but considering alternative logics to improve the scope of the compositions that may be achieved.
Capability modelling and reasoning for record-linking web-services
Supervisors: Dr Mark Cameron (CSIRO) Dr Peter Christen (DCS, ANU)
E-Mail: Mark.Cameron@csiro.au
E-Mail: Peter.Christen@anu.edu.au
Background
The Information Engineering Lab has a logic-based service composition framework that is being applied across a number of domains, including Health Data Integration (Cameron et al) http://cips.eas.asu.edu/iiwebfinalproceedings/31.pdf and environmental management http://www.iemss.org/iemss2002/proceedings/pdf/volume%20uno/304_cameron. pdf. Statistical record linkage is used to match similar but differently-represented information from heterogeneous information sources. It is most commonly used to match names and addresses of people represented in independent databases. Capability modelling and reasoning has been applied in mediation systems built over Relational databases (Vassalos and PapaKonstantinou) http://www.db.ucsd.edu/publications/vpcap2.pdf. The aim of this project is to replicate and extend that work to operate over a record linking web service system that exposes its functionality at a fine grain.
Eye-gaze tracking as an interface for virtual environments
Supervisors: Chris Gunn (CSIRO) Henry Gardner (DCS ANU)
E-Mail:
Chris.Gunn@csiro.au
Henry.Gardner@anu.edu.au
Background
This project will seek to characterise the effectiveness of eye-gaze tracking as a human-computer interface for virtual environments. The concept is that participants should be able to interact with projected images using a combination of keyboard controls and eye gaze. By focussing their visual attention on icons or images on the screen, users should be able to select these icons or activate other interaction. Two application areas of interest are the Mind Attention Interface for the Wedge virtual reality theatre and a something-or-other system in which participants stand in front of a half-spherical shell screen. In the Mind Attention Interface, eye-gaze tracking equipment will be integrated with electroencephalogram (EEG) equipment which measures electric signals from the brain of a participant. Correlating eye-gaze measurements of attention with EEG data would be a possible extension of this project.
Blind Data Linkage
Supervisor: Peter Christen
E-Mail:
Peter.Christen@anu.edu.au
Background
Integrating or linking data from different sources increasingly becomes an important task in the preprocessing stage of many data mining projects. The aim of such data linkages is to merge all records relating to the same entity, such as a patient or a customer. If no common unique entity identifiers (keys) are available in all data sources, the linkage needs to be performed using the available identifying attributes, like names and addresses. Data confidentiality often limits or even prohibits successful data linkage, as either no consent can be gained (for example in biomedical studies) or the data holders are not willing to openly provide their data.
- A medical research project, for example, may need to determine
which individuals, whose identities are recorded in a
population-based register of people suffering from hepatitis C
infection, also appear in a separate population-based register of
cases of liver cancer. Clearly such linkage between these two
databases involves the invasion of privacy of the individuals
whose details are stored in these databases. It is usually
infeasible to obtain consent from each individual identified in
each of the register databases - instead one or more ethics
committees or institutional review boards provide consent for the
linkage of the two databases on the behalf of the individuals
involved, after weighing the public good which is expected to
accrue from the research against the unavoidable invasion of
individual privacy which the research involves.
- Similarly, consider two research databases of genetic sequences of DNA, each of which are commercially valuable in their own right. The owners or custodians of these two databases may wish to determine whether their databases contain any sequences in common before deciding to collaborate and share their data, without having to first reveal the contents of their database to the other party, or to a third party.
The ANU Data Mining Group is currently working in collaboration with the NSW Department of Health, Centre for Epidemiology and Research on the development of blind data linkage methods (i.e. methods where data linkage can be performed between different parties without the disclosure of any personal or valuable information). To achieve this goal, we are using various techniques from cryptology and related areas. For more details please have a look at the recent publication Some methods for blindfolded record linkage.
Honours student research project
The aim of this honours research project is to further develop this I>blind data linkage techniques. So far we have implemented simple proof-of-concept programs written in Python. Performance issues with large (real world) databases need to be addressed, as well as distributed services and Internet connectivity. One aim is to integrate blind data linkage into our open source data linkage system Febrl.
A student working on this project would be involved through
- Participation in an exciting research project in collaboration with an industry partner.
- Exploration and development of novel privacy preserving data linkage methods using algorithms and techniques from cryptography, machine learning, statistics, indexing, data mining, and distributed and remote computing.
- Object-oriented cross-platform programming (using Python and friends).
- Open source software development (see http://sourceforge.net/projects/febrl).
For more information, publications, and further links please visit the project web page at http://datamining.anu.edu.au/linkage.html.
Internet Futures
Supervisor: Dr Markus Buchhorn
E-Mail:
Markus.Buchhorn@anu.edu.au
Background
Our group (Internet Futures) has the following Honours project proposals on offer: 1. Cluster Transcoding:
This project will explore the use of a cluster of commodity cpus to handle multi-stream, multi-source video transcoding from an arbitrary range of sources and to an arbitrary range of outputs. This would include video decoding and encoding at high performance and low latency, plus stream merging to support small-screen devices. The environment for this project is the AccessGrid (www.accessgrid.org), with targets including small-screen, low-cpu-performance, and/or wirelessly connected systems. 2. Audio processing for enhanced user interfaces in the AccessGrid
This project could have two components, or be seen as two projects. One is to develop a system for voice control of the AccessGrid, including its rich shared application space and its complex user interface. The other project would use audio information from an AccessGrid audio system to identify the location of a speaker in the room and control cameras to target the speaker.
3. Voice to Text Processing:
Again in the context of the AccessGrid, using a real-time speech to text engine to listen to the audio channels from multiple sites and use the generated text for a variety of purposes: - Render text onto video tiles as captions - Automatic transcribing for meeting minutes (have information of where audio came from etc) for recording of minutes. This will require some audio processing to generate a clean signal, and performance analysis to achieve this with low latency.
4. Middleware for grid and web services
This project will use open standards SAML and XACML, and some implementations such as Shibboleth, to build a globally scalable authentication and authorisation system. While this exists in part for web-based transaction, it does not exist in various grid and web services (submitting jobs to supercomputers), nor for database transaction such as SQL, and these are becoming vital. There are also performance issues with these systems in web-portals such as JSR-168-based portals.
Performance Instrumentation for the SCASH Distributed Shared Memory System
Supervisors: Dr Peter Strazdins and Alistair Rendell, with assistance from Jin Wong (PhD student)A Beowulf-style cluster computer is a parallel computer using Commercial-off-the-Shelf switch-based network to communicate between the processors. Clusters have proved a highly cost-effective high performance computer model, and have largely displaced the traditional massively parallel computers built with expensive vendor-supplied networks. They have even competed with the traditional symmetric multiprocessor (SMP) server in some applications, and also support for threaded programming models, based on distributed shared memory (DSM) middleware, is available on them. This permits high-level programming paradigms to be supported on clusters, such as the OpenMP. An example of this is the OMNI Open MP compiler and the SCASH (SCore-based) SDSM.
Doe to their highly mature performance instrumentation infrastructure (based on performance event counting libraries), performance evaluation methodologies have been largely successful for improving application performance and architectural design for SMPs.
This project will develop and evaluate performance instrumentation infrastructure for SCASH analogous to the performance event counting libraries on SMPs. This can be done by finding DSM analogies to SMP events, such as external cache misses/invalidations/write-backs and local/remote memory bank accesses, and adding code to the SCASH software to count these events and their associated stall cycles. Note that Jin Wong is currently working on SCASH and is an expert on its internals. This infrastructure will allow the application of performance evaluation methodologies for OpenMP program on SMP to be extended to OpenMP-enabled clusters.
References
See the links above, and also:- the IEEE Task Force on Cluster Computing home page.
- Performance event counting libraries for SMPs: Sun CPC library and the portable PAPI library.
- Jin Wong, OpenMP for Clusters, Honours thesis, DCS ANU, November 2004, (see Peter or Jin for a copy)
- Zarka Cvetanovic, Performance Analysis Tools for Large-Scale Linux Clusters, Proceedings of the 2004 IEEE International Conference of Cluster Computing, 2004 (see Peter for a copy)
Multithreading Issues on the OpenPower 720 Multiprocessor
Supervisor: Dr Peter StrazdinsA state-of-the-art OpenPower 720 Multiprocessor has recently be installed at DCS, courtesy of OzLabs branch of the IBM Linux Technology Centre. As a shared memory multiprocessor, it is an interesting system:
- a 2-way SMP (symmetric multiprocessor), i.e. two independent processors sharing a physical address space, with a coherency protocol implemented by the bottom-level caches.
- 2 CPU `cores' per SMP node, i.e. two CPUs sharing caches
- 2-way symmetric multithreading (SMT) per core, i.e. 2 threads (sharing the same address space) can execute simultaneously on each core; these threads have different register sets but share all other resources (including the CPU). The CPU can thus switch rapidly between the threads, e.g. upon a cache miss
Such a multi-level processor opens interesting possibilities in running multiple threads. The following themes (which could be expanded tom projects in their own right) thus emerge:
- investigation and evaluation of fast synchronisation algorithms, and `lockless' data structures.
- evaluate the system on scientific benchmarks, includes the OpenMP Parallel Benchmarks and the STREAM benchmarks. Here issues such as the effectiveness of SMT and non-uniform memory accesses effects are of particular interest.
These projects are of strategic interest not only to DCS but OzLabs as well.
References
See the links above, and also:- Maurice Herlihy et al, Dynamic-sized Lockfee Data Structures, Sun Microsystems technical Report T+R-2002-110 (for theme 1)
- SMT and use of multiple cores are related to the idea of chip multithreading (CMT).
Communication Configurations for SMP Clusters -or- Xen and the Art of Multiple Interface Cards
Supervisor:
Dr Peter Strazdins
client: Richard Alexander, CEO,
Alexander Technology
A Beowulf-style cluster computer is a parallel computer using Commercial-off-the-Shelf switch-based network to communicate between the processors. The ANU Beowulf cluster Bunyip is such a cluster based on Fast (100Mb) Ethernet switches. Clusters have proved a highly cost-effective high performance computer model, and have largely displaced the traditional massively parallel computers built with expensive vendor-supplied networks.
Xen is a technology providing virtualization of operating system services, emerging as one of the hottest topic in operating systems R&D. It permits, for example, multiple hosts to run on an SMP (multiple CPU) machine, binding an independent operating system on each CPU.
Linux clusters with SMP (multiple CPU) nodes with interconnects such as Gigabit Ethernet may be configured with multiple Network Interface cards (NICs). A question which naturally arises is how to optimally configure the hardware (NICs) and the software which drives them. alternatives range from the CPUs sharing channel-bonded NICs through to dedicating a NIC to a particular CPU. In the latter case, technologies such as Xen can be utilized. In both cases TCP/IP stack overhead is a concern. This project will evaluate these configurations, measuring performance via the MPI Message Passing library, using a contemporary experimental cluster hardware provided by Alexander Technology.
These projects are of strategic interest not only to DCS but Alexander Technology, who is rapidly expanding its research and development in cluster technology.
References
See the links above, and also:- the IEEE Task Force on Cluster Computing home page.
Performance Modelling of Parallel Programs on SMP Clusters
supervisors: Dr Peter Strazdins
A cluster computer such as the ANU Beowulf is a parallel computer using (typically Commercial-off-the-Shelf) switch-based network to communicate between the processors. These have proved a highly cost-effective high performance computer model, and have largely displaced the traditional massively parallel computers built with expensive vendor-supplied networks.
Exploiting the price-performance `sweetspot' even more finely, clusters whose nodes are made of low-end (2 or 4 CPU) SMPs have become popular. The Beowulf has 2-way SMP's; the DCS Sunnyvale cluster at DCS has 4-way SMPs.
It is possible to model the execution time of parallel application using simple algebraic equations. For example, the time to model sending a message of n words can be given by:
-
t(n) = a + b * n
Similarly, the execution time a loop of n iterations can be modelled by an equation of a similar form. In this way, one can model the execution time of computations such as parallel LU decomposition, first described in LAPACK working note 80.
The above techniques worked well for clusters and massively parallel computers with one CPU per node. Parallel applications on SMP clusters, however, are typically run with one process allocated to each CPU, so that there are multiple processes per cluster node. Here processes on the same node communicate via a shared memory communication transport, at a considerably faster rate than messages over the network. On the other hand, processes on node A may simultaneously try to send messages to processes on node B; in these case message speed is slowed down due to contention over the single link between A and B.
Thus, modelling the execution time of a communication operation such as a broadcast amongst a subset of these processes is not-trivial, as it may have to take into account both of these effects.
This project will investigate how to model execution times on SMP clusters, building up from communications patterns to `regular' applications, such as dense linear systems solution. Probabilistic techniques may be required to model the effects of contention. The ANU Beowulf ant the SC will form suitable experimental platforms. The project has the potential for publishable work.
References
See the links above, and also:- The IEEE Task Force on Cluster Computing home page.
- The ANU Beowulf home page.
-
Wi Bing Tan's Honours thesis - Some earlier work on communication patterns on the AP3000, and one on dense linear systems solution on the Beowulf. The problems with modelling execution time on the Beowulf emerged in a recent paper on job scheduling.
Extended Threads Emulation in an SMP Computer Simulator
Supervisor: Dr Peter Strazdins and other members of the Sparc-Sulima team.
Simulation is playing an increasingly important role in the design and evaluation of current and future high performance computer systems. Sparc-Sulima is an UltraSPARC SMP full-machine simulator currently being developed by the CC-NUMA Project (before that by the ANU-Fujitsu CAP Project). Sparc-Sulima simulates an UltraSPARC processor by interpreting each instruction.
Currently, Sparc-Sulima can simulate an unmodified Solaris (UltraSPARC) binary (executable program). When that program executes a system call trap instruction, the corresponding operating system effects are `emulated' (the simulated computer state is updated directly, rather than via simulating a lengthy and complex sequence of kernel-level instructions), and simulated execution then resumes. This mechanism is handled by the SolEmn emulation layer in Sparc-Sulima. Threaded programs (e.g. using the pthreads Posix thread libraries) ultimately access thread-related operating system services via the Solaris _lwp (lightweight process) trap instructions. The limitation of the current implementation is that, at any time, there can only be one active thread per CPU. Such a limitation has been in all emulation-based computer simulators in the literature also.
The aim of this project is to extend the emulation of the _lwp trap instructions to remove this limitation. It will require implementing thread switching, which can occur by time-outs, and also by the the _lwp_suspend and yield system calls. The current thread context save/restore mechanisms will also have to be extended. If successful, the work should be publishable.
An extension of this project, more implementation oriented, would be to simulate, rather then emulate the _lwp traps. This would involve writing a mini-kernel that would perform an implementation of lightweight threads. Now, when Sparc-Sulima executes a threaded program and encounters an _lwp trap, it will starts executing the kernel, just like on a real machine. Note that the same underlying algorithms of the emulation mode can be used; however, some considerations must now be given to the `locking' of shared data structures. The advantage of this is that kernel-level (notably memory system-related) overheads can now be recorded by the simulator. Thus, the simulator will be able to predict more accurately the threaded program's overheads.
This project is related to the project Using Threading Techniques to speed up SMP Computer Simulation. It is part of the CC-NUMA Project.
References
- Slides for a background talk for Sparc-Sulima.
- Pthreads programming by O'Reilly, and other books on POSIX threads. Solaris on-line documentation on Solaris threads, including man -s 2 intro.
- from the Sparc-Sulima home page, there is a survey paper on machine simulation (citing many more references) and also papers on the implementation of Sparc-Sulima and SolEmn. There are manuals and source codes too.
Parallelizing Applications on a Heterogeneous Cluster
supervisors: Dr Peter Strazdins, Dr Alistair Rendell, and Mr Richard Alexander
A Beowulf-style cluster computer is a parallel computer using Commercial-off-the-Shelf switch-based network to communicate between the processors. The ANU Beowulf cluster Bunyip is such a cluster based on Fast (100Mb) Ethernet switches. Clusters have proved a highly cost-effective high performance computer model, and have largely displaced the traditional massively parallel computers built with expensive vendor-supplied networks.
A typical cluster as Bunyip tends to evolve, as new processors and networks (and the monies to pay for them!) become available. Thus clusters tend to become heterogeneous. Furthermore, for a particular application, there tends to be a particular cluster configurations that is optimal. For the case of a (high performance) computing service center, design to serve many such applications, a cluster may be designed to be heterogeneous from the beginning.
The efficient parallelization of applications using the Message Passing Interface (MPI) is reasonably well understood for homogeneous clusters. However, for heterogeneous clusters, which can not only have nodes with different processor architectures but also sections of nodes connected by different networks, the situation is still far more challenging. In particular, differences in processors and networks speeds make it difficult to utilise the faster components more effectively.
In partnership with the local company Alexander Technology, who will install in 2006 a heterogeneous cluster in DCS called Jabberwocky, this project will investigate techniques used so far to parallelize (scientific/engineering) applications for heterogeneous clusters. The project will then make a selection of these applications and evaluate the effectiveness of these techniques on the Jabberwocky. New or variant techniques may then be proposed and evaluated.
This project is of strategic interest not only to DCS but Alexander Technology, who is rapidly expanding its research and development in cluster technology. Scholarships may be available for suitably qualified applicants.
References
See the links above, and also:- the IEEE Task Force on Cluster Computing home page.
- An Overview of Heterogeneous High Performance and Grid Computing, Jack Dongarra and Alexey Lastovetsky
- more info on the Jabberwocky can be found on the door of the cluster machine room (next to CSIT N331); some of the sub-clusters that will form the Jabberwocky are already there
Characterization of Application Performance of Cluster Computers
supervisors: Dr Peter Strazdins, and Mr Richard Alexander
A Beowulf-style cluster computer is a parallel computer using Commercial-off-the-Shelf switch-based network to communicate between the processors. The ANU Beowulf cluster Bunyip is such a cluster based on Fast (100Mb) Ethernet switches. Clusters have proved a highly cost-effective high performance computer model, and have largely displaced the traditional massively parallel computers built with expensive vendor-supplied networks. However, the COTS networks' raw communication speed has not kept up with the dramatic increases in processor speed, and provides a limit to the performance of many applications on clusters.
Applications written for clusters use the Message Passing Interface (MPI) to both synchronize and communicate data between its processes running on different cluster nodes. Costs of communication include both latency (the time to send a small message) and bandwidth (the rate per byte of data transfer, for larger messages).
An important aspect of research into clusters is then the characterization of applications' performance; at the lowest level: is computational resources, memory access, latency or bandwidth? For applications dominated by communication aspects, finer questions, such as can the application benefit from networks supporting bi-directional communication, and what is its potential for overlapping communication with computation, can be addressed. This information can be obtained from the literature (on the applications and benchmarks concerned), study of the application's source code and by experimentation.
In partnership with the local company Alexander Technology, this project will investigate this issue. It has a natural synergy with a sister project being undertaken by an MIT student in semester 1 2006. Together, these projects will develop a Performance Evaluation Methodology targeted at the Jabberwocky multicluster, which will be installed by Alexander Technology in DCS in early 2006.
This project is of strategic interest not only to DCS but Alexander Technology, who is rapidly expanding its research and development in cluster technology. Scholarships may be available for suitably qualified applicants.
References
See the links above, and also:- the IEEE Task Force on Cluster Computing home page.
- the {HPC Challenge Benchmark
is a cluster benchmark suite, with individual benchmarks
chosen to exhibit a variety of performance characterizations
- more info on the Jabberwocky can be found on the door of the cluster machine room (next to CSIT N331); some of the sub-clusters that will form the Jabberwocky are already there
Energy-Efficient Data Gathering in Wireless Sensor Networks
Supervisor: Dr Weifa Liang
E-Mail: wliang@cs.anu.edu.au
Background
Wireless sensor networks have received significant recent attention due to their potential applications from civil to military domains including military and civilian surveillance, fine-grain monitoring of natural habitats, measuring variations in local salinity levels in riparian environment, etc. The sensors in sensor networks periodically collect the sensed data as they monitor their vicinities for a specific interest.
The sensed data can be transmitted directly to the base station where the client queries are posted, the base station then processes the collected data centrally and responds to users queries. However, due to the fact that each sensor is equipped with energy-limited battery, the energy efficiency rather than the response time for each query is paramount of importance in sensor networks. The energy consumption for the above centralized processing is very expensive. Instead, we adopt energy-efficient in-network processing, i.e., the data gathering is implemented through a tree rooted at the base station, and the data at each sensor has to be filtered or combined locally before it is transmitted to its parent.
Unlike most existing assumptions that each sensor only transmits a fixed length of message, we consider another data gathering, in which the length of the message that each sensor transmits to its parent is proportional to the sum of the message lengths of its children and itself. Thus, the total energy consumption at a node is proportional to the length of the transmitted message and the distance between the node and its parent in the tree. Specifically, on one hand, we aim to find such an energy efficient spanning tree rooted at the sink node that the network lifetime is maximized. On the other hand, we prefer the approximate result instead of the exact result in traditional data models for a given query, since the communications among the sensors are fail-prone. Under this circumstance, how close the approximate result is to the exact result becomes the essential issue.
In summary, this project will be in pursue of energy efficiency on the basis of database modeling in sensor network environments. A student working on this project would be involved:
- participate the research project.
- study the problem complexity if taking the network lifetime as the sole metric
- propose heuristic algorithm for finding a data gathering tree and provide approximate aggregate result for a given aggregate query
- conduct experiments by simultation to validate the efficiency of the proposed algorithms.
Neural networks theory and applications
Supervisor: Prof Tom Gedeon
E-Mail: tom@cs.anu.edu.au
Background
A number of projects are possible in this area. I am interested in a number of topics, such as extracting rules from neural networks, information retrieval using neural networks, data mining and feature selection, cascade neural network structures, hierarchical neural network structures, and neural network applications. I have published papers in all of these areas with former students so there is plenty of earlier work to build on. No previous experience with neural networks is necessary. Most projects will use the very popular backpropagation neural network training algorithm. If you are interested in neural networks and would like to see if you might want to do a project in this area, please e-mail me tom@cs.anu.edu.au or come and see me.
Example projects: i. Cascade neural network stuctures can be built automatically without making decisions as to the number of neurons required to solve a problem by adding single neurons. This project would investigate the use of larger chunks such as feature maps as cascade components, which would be useful for recognising images (including faces).
ii. Rule extraction: Neural networks can learn complex tasks but face the problem that human beings do not trust them as they can not understand _why_ a particular decision is made. This project focuses on rule extraction for explanation.
Fuzzy logic theory and applications
Supervisor: Prof Tom Gedeon
E-Mail: tom@cs.anu.edu.au
Background
A number of projects are possible in this area. I am interested in a number of topics, such as automated construction of fuzzy rule bases from data, hierarchical fuzzy systems, fuzzy interpolation, information retrieval using fuzzy logic,, universal approximators, and fuzzy logic applications. I have published papers in all of these areas with former students so there is plenty of earlier work to build on. No previous experience with fuzzy logic is necessary. If you are interested in fuzzy logic and would like to see if you might want to do a project in this area, please e-mail me tom@cs.anu.edu.au or come and see me. Example projects: i. Fuzzy document filtering: Web search engines return many documents which are not relevant to queries. Fuzzy techniques to enhance the results from search engines will be very useful. This project will investigate a number of fuzzy techniques for this purpose.
ii. Combining uncertainty: Fuzzy logic is a technique which deals with uncertainty well. Sometimes data contains multiple kinds or sources of uncertainty. For example, a pedestrian could report that he saw someone he was quite sure was stealing something. He was not certain and his own reliability is another but different kind of uncertainty. This project is to develop some methods to combine different kinds of uncertainty in intelligence led investigation. (Data from a company specialising in this area will be available.)
Bacterial evolutionary algorithms
Supervisor: Prof Tom Gedeon
E-Mail: tom@cs.anu.edu.au
Background There are several optimisation methods inspired by natural processes, It has been shown that evolutionary algorithms are efficient tools for solving non-linear, multi-objective and constrained optimizations. The principle is a search for a population of solutions, where tuning is done using mechanisms similar to biological recombination. The operations the bacterial evolutionary algorithm were inspired by microbial evolution. Bacteria can transfer genes to other bacteria, so the rules are optimised in each bacterium.
Example projects:
i. Clustering using BEAs: Clustering is more usually performed by
gradient descent algorithms. BEAs should allow more interesting
clusters to be formed where there are complex evaluation criteria
for cluster validity. Objects clustered could be images (e.g. see
1.i or 4.ii), or text (e.g. see 2.i, 2.ii). Clusters could have
predefined target criteria, such as hierarchical structure of a
certain depth.
ii. Making pictures: If bacteria encoded the components of an abstract computer generated picture, an individual could identify nice and not nice images repeatedly to generate some 'art' which is tuned for their esthetic sense.
iii. Scheduling: A previous student has worked on a GA for University exam scheduling. A comparison with bacterial algorithms would be an interesting project.
Face Recognition
Supervisor: Prof Tom Gedeon
E-Mail: tom@cs.anu.edu.au
Background
A number of projects are possible in this area. I am interested in a number of topics, such as image processing for face recognition, HCI research tool to collect facial images, biologically plausible architectures for face recognition, building and modifying computer models of real faces. No previous work on face recognition is necessary. For the image processing topic, some experience with computer graphics would be useful. If you are interested in face recognition or some similar topic and would like to see if you might want to do a project in this area, please e-mail me tom@cs.anu.edu.au or come and see me.
Example projects:
i. View morphing: View morphing between two images of an object taken
from two different viewpoints produces the illusion of physically
moving a virtual camera. We generate compelling 2D transitions
between images. Initially we will use the same face with small
rotations and produce intermediate images to produce a realistic
QuickTimeVR head.
ii. Video annotation: Techniques such as PCA can recognise the same face so long as the rotation/expression/illumination is not too different. This can be used to recognise one person throughout a segment of video.
iii. 3-D Face model: An average face 3D model can be constructed from a single face image and animated via rotation, a simple face expression model, or illumination. This can then be used to recognise faces in multiple poses, expressions or lighting.
Probabilistic Temporal Planning
Supervisor: Dr Doug
Aberdeen, Dr Sylvie Thiébaux, Dr Olivier Buffet
E-Mail: doug.aberdeen@anu.edu.au,
sylvie.thiebaux@anu.edu.au,
olivier.buffet-AT-nicta.com.au
Background
Planning is the process of finding which sequence of decisions to
take so as to achieve a given goal. We are currently working on a large
planning domain aiming at scheduling tasks with probabilistic outcomes
[1]. To our knowledged, the tool which has been developed is the first
one handling time, resources and probabilities at the same time.
Various questions can be addressed in this framework. Improving the
planning algorithms is a major issue, may that be in the framework of
Markov Decision Processes or more classical planning. Other issues as
plan visualization or human-machine interaction are of interest, as the
plan should be easy to understand and the user may need to give
feedback to the software (by setting new constraints for planning).
There are several sub-projects available. We highly encourage
interested students to come and talk to us to discuss specific
possibilities. Software used in this project may end up being use
by the Defence Science Technology Organisation.
References
[1] D. Aberdeen and S. Thiébaux and L. Zhang,
"Decision-Theoretic Military Operations Planning", in Proceedings of
the Fourteenth International Conference on Automated Planning and
Scheduling (ICAPS'04), Whistler, Canada, June 2004.
Automatic Hierarchy Discovery
Supervisor: Dr Olivier Buffet
E-Mail: olivier.buffet-AT-nicta.com.au
Background
Reinforcement Learning consists in learning a policy (how to act in
various situations) in order to maximise a payoff (which describes a
goal to achieve). Considering probabilistic environments, we work in
the framework of Markov Decision Processes (MDP) [1], where an action
may lead to different outcomes. As state spaces may be very large, it
is of major interest to consider a given problem as a hierarchy of
problems ("divide and conquer" approach). Unfortunately, if various
works have led to efficient algorithms exploiting such a hierarchical
structure, automatically discovering a hierarchy remains a very
difficult task.
A recent work [2] has proposed a first approach to automatic hierarchy
discovery, based on analysing which observed variables seem to be the
most dependent on the decisions made. Yet, this algorithm relies on an
appropriate choice of variables. The main concern of this project is
therefore to find how to get rid of this choice of variables, the
system being able to create "artificial variables" if they seem more
appropriate than the original ones.
References
[1] L. Kaelbling, M. Littman and A. Moore, "Reinforcement Learning:
A Survey", Journal of Artificial Intelligence Research, 4, pp 237--285,
1996.
[2] B. Hengst, "Discovering Hierarchy in Reinforcement Learning with
HEXQ", in Proceedings of the Nineteenth International Conference on
Machine Learning (ICML'02), pp 243--250, 2002.
Dr Pascal Vuylsteker's projects
Pascal Vuylsteker's honours projects.Learning to Pick the Best SAT Solver
Supervisor: Dr Anbulagan and Dr Kee Siong Ng
E-Mail: anbulagan@nicta.com.au
E-Mail: kee.siong@rsise.anu.edu.au
Background
he satisfiability (SAT) problem is one of the most fundamental problem in computer science, with wide ranging applications in automated reasoning, integrated circuit design, machine vision, planning, scheduling and what not. It is an instance of the more general constraint satisfaction (CSP) problem, which involves finding instantiations of variables subject to constraints on acceptable values. A SAT problem is a CSP problem where all the variables are boolean-valued. Over the years, many efforts have gone into developing efficient solvers for different classes of SAT problems. Understandably, there is no one best solver for every SAT problem; each solver has its own strengths and weaknesses. The aim of this project is not to develop yet another specialised SAT solver, but to find a way to build on existing work and design a meta algorithm on top of existing specialised solvers that can decide, for any given SAT problem, which solver to put to work. Machine learning is a technology that can be brought to bear on this problem. We have lots of data on the relative performance of existing solvers on different classes of SAT problems. The problem then comes down to whether we can learn a function to predict the best solver given information on the structures available in a given SAT problem. This is the problem the student is expected to solve, in collaboration with the researchers involved. We have two very competitive baseline SAT solvers on which to build on. R+AdaptNovelty+ won a gold medal for Random SAT category at the 2005 International SAT Competition, and Dew_Satz won 2 bronze medals for Random UNSAT and SAT+UNSAT in the same competition. Both solvers were developed locally by Anbulagan. This project would suit a student with good programming skills and interest in artificial intelligence and machine learning.
Data Cleaning Via Logic
Supervisor: A/Prof Rajeev Gore
E-Mail: rajeev.gore@anu.edu.au
Background
The Australian Bureau of Statistics stores its census information as a collection of tables with one row (record) for every individual with many columns like age, sex, marital status, current occupation etc. Such tables invariably contain errors, either introduced during data collection, or during data entry. For example, the census data may contain a record for a person named X who is aged 6, is married and in high school.
Such data usually has an associated collection of extra conditions which capture important inter-relationships between the columns of the table. For example, the census table may have constraints like "the legal age of marriage is 16" and "a high-school student is aged between 11 and 18". According to these constraints, the entry for X is incorrect since the entry breaks both conditions, so the obvious question is to find a correction which is as close to the original as possible.
The problem of finding an error in a data table is known as error-localisation, and the problem of changing an incorrect record so that it passes all the constraints is known as error-correction. The standard method for solving both of these problems is called the Fellegi-Holt method, but it is highly inefficient. Recently, Agnes Boskovitz, a phd student with the RSISE, has shown a deep connection between the Fellgi-Holt method and a well-known method from propositional logic called resolution.
Efficient automated tools for performing resolution and associated logical deductions are readily available. The project is to implement the Fellegi-Holt method using these logical tools to gauge the efficacy of this logical approach to data cleaning. If successful, the prototype has huge potential in real world applications and will almost certainly lead to a publication in an international conference or journal.
Feel free to email me if you are at all interested in this project.
Interval Arithmetic for Computational Science
Supervisor: Alistair RendellE-Mail: Alistair.Rendell@anu.edu.au
An interval is a continuum of real numbers bounded by its end-points. Interval arithmetic performs operations on intervals rather than using discrete floating point representations. As a consequence of this errors associated with numerical rounding are automatically propagated and accounted for. Moreover the mathematical properties of intervals can be exploited to develop entirely new algorithms.
The concept of intervals is actually not new, but was considered by Moore in 1965. What is relatively new is the fact that some hardware vendors are now providing direct support for interval arithmetic within their compilers/hardware.
In this project we wish to develop an interval code for evaluating the type of integrals that are used in quantum calculations on atoms and molecules. Evaluations of these integrals involves truncation of an infinite series, followed by the application of a variety of recursion relations. As a consequence the value of the final integrals reflect both a series truncation error, and a rounding error which can vary according to the recursive path taken. This project will study the balance between these two sources of errors and the implication of this in large scale quantum calculations.
This project will build on work undertaken in 2005 by honours student Josh Milthorpe who studied the basic performance of interval operations and use of intervals in fast multipole calculations. You will also be part of a team working on interval related methods funded from an Australian Research Council Discovery grant.
Automatic Differentiation
Supervisor: Alistair RendellE-Mail: Alistair.Rendell@anu.edu.au
Every function, no matter how complicated, is executed on a computer as a series of elementary operations such as additions, multiplications, divisions and elementary functions like exp(). By repeated application of the chain rule to those elementary operations it is possible to obtain the derivative of any computed function in a completely mechanical fashion. This technique is applicable to computer programs of arbitrary length containing loops, branches and procedures. In contrast to explicit evaluation and coding of a derivative, automatic differentiation enable derivatives to be easily updated when the original code is changed.
A number of programs are now available to perform automatic differentiation for code written in a variety of languages The aim of this project would be to use one of these tools to perform derivative evaluations for some computational science applications. Specifically in the modeling of systems at the atomic level derivatives with respect to atomic positions are hugely important determining things like the frequencies at which molecules vibrate. Myself and others have spent large quantities of time and effort deriving expressions for derivatives and then coding them. To automate this process and for it to be relatively efficient would be very significant. Your objective would be to explore this issue for some relatively simple computational molecular science codes.
Numerical Simulation on Graphics Processing Units (GPU)
Supervisor: Dr Alistair RendellE-Mail: Alistair.Rendell@anu.edu.au
Background
Over the past 10 years graphics processing units (GPU) have become ubiquitous in desktop computers. Modern GPUs now allow substantial user programability that lets them be used for more general-pupose computation. Indeed in a recent paper that was presented at SC04, Fan et al detail their efforts to construct a "GPU Cluster for High Performance Computing". In this work the authors plugged in 32 GPUs to a Xeon cluster system and in so doing increased the total theoretical peak performance of the machine by 512Gflops - for a cost just under US$13,000!!
Tracking this development there has been a number of groups developing programming environments that can exploit the GPU for general computations. Details of many of these are available at the GPGPU web site. Two of particular interest to this project are Brook and Sh. The former is built on top of C while the latter is built on top of C++. Brook for example extends standard ANSI C with the concept of a "stream", which is a new data type representing data that can be operated on in parallel.
The aim of this project is to investigate the potential for using the GPU in molecular science computations. The initial target application is likely to be N-body simulations, as typified by the simulation of atom motions. Careful consideration will be given to accuracy, since currently the GPU is limited to single precision computations. Thus the project will address two issues, i) the in principle feasibility of using the GPU for these types of problems, and ii) the scientific usefulness given the current single precision limitation.
InfiniBand RDMA Remote Memory Swapping
Supervisors: Richard Alexander, Alistair Rendell, Peter StrazdinsE-Mail: Richard@AlexanderTechnology.com, Alistair.Rendell@anu.edu.au, Peter.Strazdins@cs.anu.edu.au
Modern RDMA capable networks such as InfiniBand provide low latency of a few microseconds and high bandwidth up to 10Gbps. This has significantly reduced the gap between access to local and remote memory in modern clusters. A remote paging system has been developed by DK Panda et al to exploit the RDMA capabilities of InfiniBand, using a High Performance Networking Block Device (HPBD) which serves as a swap device for Linux kernel virtual memory for efficient page transfer to and from remote servers. This HPBD has been implemented for the Linux kernel 2.4 as a kernel module. This project will implement the RDMA HPBD as a kernel module for Linux kernel 2.6, and investigate ways to improve the performance of the RDMA HPBD to reduce copy cost and utilize the zero-copy feature of RDMA operations. Alexander Technology as part of the Jabberwocky project will provide hardware for this project.
References:
- Shuang Liang, Ranjit Noronha, DK Panda, Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device, Proceedings of the 2005 IEEE International Conference on Cluster Computing.
- Jiuxing Liu, Jiesheng Wu, DK Panda, High Performance RDMA-Based MPI Implementation over InfiniBand. Proceedings of The 17th Annual International Conference on Supercomputing (ICS'03).
- InfiniBand Trade Association home page
MPI and Global File System I/O over InfiniBand using Virtual Lanes
Supervisors: Richard Alexander,
Alistair Rendell,
Peter Strazdins
E-Mail:
Richard@AlexanderTechnology.com,
Alistair.Rendell@anu.edu.au,
Peter.Strazdins@cs.anu.edu.au
InfiniBand networks provide for multiple, parallel system area networks using the same physical hardware, using a concept called Virtual Lanes. Virtual Lanes allow the network to be split for a variety of purposes including the separation of HPC cluster management, message passing and I/O traffic. There are a number of parallel filesystems used in heterogeneous Linux clusters, with the most widely used being PVFS2, GPFS, Lustre in addition to the legacy NFS system. One possibility with large SMP nodes in a cluster is to use InfiniBand virtual lanes and operating system virtual machines in a way that allows nodes of a cluster to be used as both compute nodes and filesystem storage nodes. This project will implement the Lustre filesystem on a small test cluster with large SMP nodes and the InfiniBand interconnect, using virtual machines in a novel way to segregate compute functions from I/O.
References:
- J Liu, A Vishnu, DK Panda, Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation. Proceedings Supercomputing 2004.
- J Cope, M Oberg, H Tufo and M Woitaszek, Linux Multi-Cluster Environments.
Declarative theorem-proving
Supervisor: Dr Michael Norrish
Email:
michael.norrish@nicta.com.au
This project would suit a student keen to work in an area involving logic and proof, and the development of infrastructure to support these activities.
When specifying systems formally, a significant error-finding benefit can accrue simply from writing everything down in an expressive logic, and getting it all to type-check. Working in this way allows users to quickly deal with the “low-hanging fruit”; important bugs that are easy to find.
The HOL system currently doesn’t allow for this model of work very well. Part of the reason for this is that script files for HOL inter-mingle SML (which is the theorem-prover’s implementation language) and embedded bits of logic. We would like to check the logic without having to be able to parse (and execute) SML as well.
Therefore, the first aim of this project is to design a simple file format in which users could write “SML-less HOL”, and to then implement at least two tools for manipulating this format. The first tool would be the type-checker. It would make sure that the definitions and other logical material in the file actually made sense at the level of types. The second tool would do all this and also produce a genuine HOL theory, using the various definitional facilities already present in HOL. This file format would need to include mechanisms for referencing other files, where previous logical developments had been built up.
The final stage of the project would be to implement a simple declarative language for writing proofs. The issues here are similar to those for definitions; to escape the use of files that include (SML) code, and move to formats that are easy to parse and manipulate as data.
Finding Models of Unsolvable First Order Goals
Supervisors: Dr Michael Norrish
Email:
michael.norrish@nicta.com.au
When using an interactive proof assistant, a powerful tactic for finishing goals is the use of first order proof procedures. Unfortunately, these procedures are all-or-nothing affairs: if they succeed, they succeed completely, but if they fail to prove the goal, the user knows only that the provided lemmas and assumptions were not sufficient to allow the procedure to find a proof. The advent of model generation tools promises a way out of this painful situation. These tools can be used to generate what is effectively a counter-example to false goals. This is much more helpful for the user. Either the tool's output really is a counter-example and their proof will never succeed, or the counter-example will be impossible because the user has forgotten to tell the tools some relevant lemma from a database of known facts. The student's task in this project would be to implement a connection between two existing tools: the interactive system HOL4, and the model generation tool Darwin. This connection would provide a user-friendly implementation of the scheme described above: when an invocation of the first order prover in HOL failed, the connection implementation would translate the HOL goal into terms that Darwin could understand, call Darwin, and then finally translate Darwin's provided counter-example back into a human-readable form for the user's benefit. This project would suit a student familiar with functional programming (HOL is written in SML), and happy with the concepts of first order logic (terms, formulas, quantifiers). The student would be jointly supervised by Michael Norrish and Peter Baumgartner.
Optimizing for Design Patterns
Supervisor: Dr Steve Blackburn and Dr Alonso Marquez
Email: Steve.Blackburn@anu.edu.au and
alonsomarquezes@yahoo.es
Background
The software engineering benefits of design patterns are widely understood. Consequently their use has become ubiquitous in industry. This widespread uptake is in spite of many patterns incurring significant performance penalties. This phenomena is only just emerging and so far has not received attention from the programming languages community.
Previously we have shown that bytecode transformations at class loading time can be used to make significant performance optimizations for Java programs. Our approach is powerful and unorthodox, sidestepping heavyweight pointer analyses typically used by the mainstream compiler community. This project will leverage this approach to optimize for some heavily used design patterns.
We anticipate that this work could lead to publication.
Performance profiling for Java
Supervisor: Dr Steve Blackburn and Daniel FramptonEmail: Steve.Blackburn@anu.edu.au and Daniel.Frampton@anu.edu.au
Background
A popular profiling technique is to use sampling. Periodically the program counter is sampled, and over the run of a program (involving many thousands of samples), a histogram is produced. The program counter values are mapped back to source lines of code to identify which source code is most heavily executed. This is often very helpful in debugging performance problems. Refinements to this idea include walking the stack at each sample point, to enrich the context. Another refinement is to use hardware performance counters. Hardware performance counters can be set to trigger interrupts after a given number of events (such as L2 cache misses, for example). This can be used to more specifically narrow down performance problems by identifying what code is triggering particular performance problems. Our group is leading work in Java performance and optimization and we are responsible for key research infrastructure to this end. One crucial tool we lack is the capacity to do performance counter driven profiling. This project would build a performance counter based profiler for Jikes RVM (the widely used research virtual machine), and would analyze the performance of some key Java programs which are currently not well understood.
Sane Garbage
Supervisor: Dr Steve Blackburn and Daniel Frampton
Email: Steve.Blackburn@anu.edu.au and
Daniel.Frampton@anu.edu.au
Background
Garbage collection is key to modern languages like Java and C#. Our group plays a leading role in the development of garbage collection algorithms and we maintain the leading infrastructure for memory management research, MMTk. Garbage collectors are notoriously difficult to debug, as they subvert most assumptions about program behavior (by manipulating the object graph behind the back of the application). There are very few useful tools for debugging garbage collectors---typically we must resort to primitive methods such as the careful use of print statements!
This project will develop a heap sanity checker. The heap sanity checker will be completely unobtrusive, only observing, not altering the state of the system. The sanity checker will always be present in the system, though inactive unless invoked at the command line. The checker would report on dangling pointers, uncollected garbage, etc. The tool would also be extensible, allowing the addition of algorithm-specific checks (such as reference count checks for a reference counting collector). This tool would be invaluable to garbage collection research.
An allocator for machine code
Supervisor: Dr Steve Blackburn and Daniel Frampton
Email: Steve.Blackburn@anu.edu.au and
Daniel.Frampton@anu.edu.au
Background
Modern virtual machines use Just In Time (JIT) compilers to dynamically generate machine code. The placement of the code in memory can have significant performance impacts. For example, if two contemporaneously hot methods are located in memory in such a was as to conflict within the instruction cache, then performance will suffer. Furthermore, the needs of code allocation are somewhat different to conventional allocation because code objects tend to be larger than normal objects and the size distribution is much less homogeneous. Finally, code allocation is rare, so a slightly more expensive allocation mechanism that is more careful with space may be far preferable. Currently in Jikes RVM code is allocating using a conventional allocator. This project would develop a new allocator that would allow the user a higher degree of flexibility in the placement of code (possibly specifying cache lines to avoid, or cache lines to favor), and would be better tailored to the space and time requirements of code allocation.
Java to C++ Translation
Supervisor: Dr Steve Blackburn and Robin Garner
Email: Steve.Blackburn@anu.edu.au and
Robin.Garner@crsrehab.gov.au
Background
MMTk is a memory management toolkit that is an important tool for memory management research. MMTk is written in a special dialect of Java which avoids dynamic memory allocation and the use of Java libraries which might perform such allocation, and adds to Java unboxed types for memory primitives such as Address, Word, and ObjectReference. The result is a powerful, flexible and high performing toolkit which allows researchers to rapidly experiment with garbage collection ideas. MMTk was implemented within Jikes RVM, a Java virtual machine written in Java. Since then MMTk has been ported to another virtual machine written in C. Rather than port the source code from Java to C, it is desirable to somehow produce a C library usable by the C program. In the first instance, this was done using the gnu compiler for java, gcj. We would like a more general solution. This project will write a Java to C++ translator for MMTk. The limited subset of Java used by MMTk makes the project feasible. We envisage the automatic translator both generating C++ files and headers with inline functions in them, allowing key elements of the interface to be inlined into the host virtual machine, greatly improving performance.
The Active National University: applying location awareness
Supervisor: A/Prof Chris Johnson
E-Mail: cwj@cs.anu.edu.au
Background
The proliferation of wireless access points and mobile access devices across university campuses provides an opportunity to use information on people's locations, provided by their wireless computing devices, to support their working and recreational life on campus. The Active Campus project at University of California San Diego is an example of a prototype system. In any such system there are technical issues about the effectiveness of the wireless network as a location sensor and information carrier, issues of who provides and administers access, identification of devices and people, the design of applications and their interfaces on mobile devices, privacy and ethics in tracking and informing on people, and the effectiveness of systems as people use them and adapt to their use. Wireless location on this scale has low precision and is highly variable. The ANU is deploying a network of 802.11b (WiFi) wireless access points; it's time to turn the campus into the Active National University. Software Engineering honours student research project Several projects in this area will potentially support each other:
1. Support framework for applications: (software engineering biassed) Analyse, design and implement a robust, extendable framework that demonstrates support for a small suite of basic applications suited to the ANU campus. Mobile computing platforms can be those anticipated in the near future, and the applications can be selected or invented to support education, research, and social activities. The system must incorporate a strong flexible emphasis on privacy and security.
2. Location Authority and Service: Various structures and requiremetns for Location Authorities for location based services have been suggested, in particular by Shafer. A robust, long-life, easily maintained Location Authority is essentially a database that maps the device identity of wireless access points onto their geographic locations on campus, but it also provides procedures and utilities for creating and maintaining the actual data as access points are installed, moved, go down and return to service. The LA should interoperate with Geographic Information Systems such as those giving the coordinate systems provided by available maps of ANU campus as a demonstration example.
3. Novel applications: Design, implement and evaluate the effectiveness and social issues of novel location-aware applications on the Active National University framework system. Human and social factors are as important here as communications and information services.
More information
- Guiding, Tracking and perceptual computing: http://www.designing-ubicomp.com/Arbeitsdateien/7_tracking+guiding.html
- Ubiquitous Computing (in Daily Wireless): http://www.dailywireless.org/modules.php?name=News&file=article&sid=1631
- ActiveCampus: Experiments in Community-Oriented Ubiquitous Computing, William G. Griswold, Patricia Shanahan, Steven W. Brown, Robert Boyer, Matt Ratto, R. Benjamin Shapiro, Tan Minh Truong, IEEE Computer, October 2004 pp. 73-81
- The Carrot Approach: Encouraging Use of Location Systems, by Kieran Mansley, Alastair R. Beresford, David Scott. in N. Davies et al .(eds) UbiComp 2004, LNCS 3205, pp 366-383, 2004.
- Everyday Encounters with Context-Aware Computing in a Campus Environment Louise Barkhuus and Paul Dourish in N. Davies et al .(eds) UbiComp 2004, LNCS 3205, pp 232-249, 2004
- Steven A. N. Shafer, Location Authorities for Ubiquitous Computing, in Mike Hazas, James Scott, and John Krumm (eds), UbiComp workshop on Location-Aware Computing, 2003. at http://www.ubicomp.org/ubicomp2003/workshops/locationaware
