The following is a list of project proposals by Dr Peter Strazdins that are suitable for coursework projects (3rd year BIT, 4th year BSEng, Honours, Masters), valid for semester 1/2 2008 :
Evaluation of Multicore Programming Paradigms for Multicore Computing
(research):
There is a plethora of programming paradigms available for multicore
computing, ranging from library-based (Posix threads), annotation-based
(e.g.\ OpenMP, Clik), object-based (e.g. Intel TBB), semi-implicit
parallel (Fortress, X10, Chapel) and functional (e.g. Erlang). Some have
emerged over the last few years and are still developing.
Considerations include succinctness and ease-of-programming, (potential
for) performance and scalability, dealing with locality and
heterogeneity.
This project will review existing literature on these paradigms with
particular attention being paid to the multicore targets. It will also
develop a series of small programming case studies on the T2 (and
possibly the OpenPower 720) in order to evaluate these paradigms. Not
only is this project of high strategic importance to the teaching and
research programs in DCS, it is potentially an immensely valuable
educational experience!
Basic Linear Algebra kernels for the T2: (research/implemnation):
The Niagara or CoolThreads (T) series of processors has been motivated by
commercial applications such as web, database and eCommerce servers. Thus,
they have been able to achieve high throughput on workloads with many execution
threads, which are integer, memory and I/O intensive.
However, it has come equipped with a floating point unit on each core.
The question arises: how effective is it for (parallel) numerical
applications, and how can it be best programmed for these?
This project will investigate how to program the matrix-matrix multiply
kernel from the BLAS library, dgemm(), for the T2, optimizing
it for both single core and multicore/thread performance. It will also
evaluate its performance. Since dgemm() forms the computational
heart of almost all dense linear algebra computations, it is suprising
that there seems to be very little work done on this topic so far.
Evaluation of the T2's Multithreading Performance: (research):
The T2 not only has 8 computational cores, but each core can
concurrently support 8 computational threads. The question arises: how
much speedup can you get from the multiple cores, and how much more can
you get out of the hardware threading? Furthermore, what is the best
way to synchronize software threads, running across different cores and
hardware threads? In what ways is such a processor different from a
traditional shared memory processor?
This project will build on the a previous project by Hugh
Blemings based on the OpenPower 720. It involves evaluating the T2
on memory- and compute- intensive parallel benchmarks and
synchronization algorithms.
Evaluation of Virtualization Capabilities - LDoms (research):
(this is a variant, or possibly an add-on, for the above topic)
In order to realize its immense potential of power saving in the server
context, the T2 supports virtualization, in which a single T2
physical machine can simultaneously support many virtual server machines
for various clients. The mechanism for virtualization in the T2 is
called Logical Domains. However, virtualization typically
involves a computational cost.
This project will first undertake a review of the LDom virtualization
mechanism, and then investigate the effect of running various
benchmarks and applications under virtualizated and native domains. This
project is of immediate interest to the DCS teaching program, as for
security, and ease of managemnt considerations, it is desirable to
export separate virtual computers for teaching a research puproses,
provied the overheads for doing so are at acceptable (or at least
understood).
There is a publically available open-source code for a highly optimized benchmark called High Performance Linpack (HPL). It is also highly tunable, but the tuning currently must be done by hand. This is problematic, in the sense that it requires both knowledge if the complex algorithms used in HPL (beyond the understanding of most people who wish to use the benchmarks) and of the (cluster) supercomputer itself. Furthermore, in the case of clusters, this problem is exacerbated by the fact that the cluster may be non-homogeneous (so the benchmark results may depend on which parts of the cluster are selected), and by the fact that by their nature clusters may be easily reconfigured. Thus, it is highly desirable that this tuning process can be automated.
The LAPACK For Clusters (LFC) project went partway in addressing this, in that some work on determining, given p processors, to the best logical P x Q logical processor configuration, where P*Q <= p was made. However, automatically choosing the critical blocking parameter and a plethora of algorithm variants remains to be done.
This project will investigate automatic tuning methods for the HPL package for cluster computers. The search can be done statically, and employ efficient empirical methods based on techniques such as the Nelder-Mead simplex search, as is done for the ATLAS package (which itself can be used as a component of HPL).
The implementation version of the project will begin by providing an implementation using simple empirical methods with an evaluation. The research version would investigate a variety of more sophisticated methods, and would evaluate their suitability in this context. This, and fact that the project spans the AI and CSys areas, makes this project suitable for the CS Group Project. The DCS Saratoga cluster will form a suitable experimental platform. The project has the potential for publishable work and for the production of widely used open-source software.
A Beowulf-style cluster computer is a parallel computer using Commercial-off-the-Shelf switch-based network to communicate between the processors. The ANU Beowulf cluster Bunyip is such a cluster based on Fast (100Mb) Ethernet switches. Clusters have proved a highly cost-effective high performance computer model, and have largely displaced the traditional massively parallel computers built with expensive vendor-supplied networks.
A typical cluster as Bunyip tends to evolve, as new processors and networks (and the monies to pay for them!) become available. Thus clusters tend to become heterogeneous. Furthermore, for a particular application, there tends to be a particular cluster configurations that is optimal. For the case of a (high performance) computing service center, design to serve many such applications, a cluster may be designed to be heterogeneous from the beginning.
The efficient parallelization of applications using the Message Passing Interface (MPI) is reasonably well understood for homogeneous clusters. However, for heterogeneous clusters, which can not only have nodes with different processor architectures but also sections of nodes connected by different networks, the situation is still far more challenging. In particular, differences in processors and networks speeds make it difficult to utilize the faster components more effectively.
Using the Jabberwocky heterogeneous cluster, this project will investigate techniques used so far to parallelize (scientific/engineering) applications for heterogeneous clusters. The project will then make a selection of these applications and evaluate the effectiveness of these techniques on the Jabberwocky. New or variant techniques may then be proposed and evaluated.
Recently, IBM / Sony / Toshiba developed the Cell BroadBand Engine, an early version of which was incorporated in the Sony Playstation 3 (PS/3). This processor, with its unique design, has an enormous potential for floating-point performance (single precision). But the programmability of this architecture remains an issue.
This project will begin by investigating prior work in the area. Implementation will begin by implementing both `right-looking' (e.g. LAPACK's sgetrf()) and left-looking versions of LU factorization algorithms, linked with IBM's SDK BLAS routines, which can make use of the CBE as a `function accelerator'. The performance of variants using different blocking factors, and determining which parts need to be offloaded from the main processor can be undertaken. Time permitting, programming the Cell's internal processing elements more directly may be investigated.
The implementation version of this project will involve developing the codes and a test harness to perform these evaluations. The research version will involve a deeper analysis into the performance, and look at more advanced methods.
The simplest method for such applications to run a separate MPI process on each CPU in the system, but the efficiency of this depends on whether the MPI implementation is sufficiently intelligent to make use of the fact that there will be large sub-groups of processes that are running on the same node and hence can communicate by much faster communication mechanisms.
This project will investigate the efficiency of the highly sophisticated OpenMPI implementation of MPI on large-scale shared memory nodes, such as the Freemont (and 8 x 2 CPU Opteron) and IBM OpenPower 720 (2 x 2 CPU x 2-way multithreading) systems hosted in DCS. It will begin by an investigation of the relevant algorithms used in OpenMPI, concentrating on barriers and collective operations, comparing them (where applicable) with corresponding operations using threaded programming paradigms.
The implementation version of the project will involve constructing relevant benchmarks and a test harness to perform the experiments. An evaluation of the current implementation of OpenMPI in this context will then be given. The research version of the project will also involve investigating alternate methods (from the literature), and evaluating these as well.
Simulation is playing an increasingly important role in the design and evaluation of current and future high performance computer systems. Sparc-Sulima is an open-source UltraSPARC SMP simulator, written in C++ and developed by the CC-NUMA Project. Sparc-Sulima simulates an UltraSPARC processor by interpreting each instruction. It has timing models which predict an UltraSPARC III Cu shared memory processor performance to a high degree of accuracy.
Currently, Sparc-Sulima can simulate an unmodified Solaris (UltraSPARC) binary (executable program). When that program executes a system call trap instruction, the corresponding operating system effects are `emulated' (the simulated computer state is updated directly, rather than via simulating a lengthy and complex sequence of kernel-level instructions), and simulated execution then resumes. This mechanism is handled by the SolEmn emulation layer in Sparc-Sulima. Threaded programs (e.g. using the pthreads Posix thread libraries) ultimately access thread-related operating system services via the Solaris _lwp (lightweight process) trap instructions. The limitation of the current implementation is that, at any time, there can only be one active thread per CPU. Such a limitation has been in all emulation-based computer simulators in the literature also.
The aim of this project is to extend the emulation of the _lwp trap instructions to remove this limitation. It will require implementing thread switching, which can occur by time-outs, and also by the _lwp_suspend and yield system calls. The current thread context save/restore mechanisms will also have to be extended. If successful, the work should be publishable (along with the existing work on threads emulation).
An extension of this project, more implementation oriented, would be to simulate, rather then emulate the _lwp traps. This would involve writing a mini-kernel that would perform an implementation of lightweight threads. Now, when Sparc-Sulima executes a threaded program and encounters an _lwp trap, it will starts executing the kernel, just like on a real machine. Note that the same underlying algorithms of the emulation mode can be used; however, some considerations must now be given to the `locking' of shared data structures. The advantage of this is that kernel-level (notably memory system-related) overheads can now be recorded by the simulator. Thus, the simulator will be able to predict more accurately the threaded program's overheads.
An extension of this project, more research-oriented, would be to investigate which thread scheduling heuristics best approximate real machine behavior.
This project is related to the project Porting a SPARC-Solaris simulator to x86-Linux. It is part of the CC-NUMA Project. It has the potential for publishable work.
Currently, Sparc-Sulima can simulate an unmodified Solaris (UltraSPARC) binary (executable program). When the program executes a system call trap instruction, the corresponding operating system effects are `emulated' (the simulated computer state is updated directly, rather than via simulating a lengthy and complex sequence of kernel-level instructions), and simulated execution then resumes. This mechanism is handled by the SolEmn emulation layer in Sparc-Sulima. For example, threaded programs (e.g. using the pthreads Posix thread libraries) ultimately access thread-related operating system services via the Solaris _lwp (lightweight process) trap instructions.
A limitation of the current implementation of the SolEmn emulation mode is that it must be run on a Sparc-Solaris host. This limits greatly where the simulator can be used. Note however that simpler and more limited emulation modes have been ported to the x86-Linux; these modes can only run specially (and statically) linked binaries - and this excludes threaded programs!
The aim of this project is to `port' the Sparc-Sulima to the most important platform - x86/Linux! The majority of work will be in SolEmn: the emulation of Solaris system calls using Linux system calls. Since, there is not always a one-to-one relation, this is not trivial. The strategy would be to gradually extend the number of ported calls and test with statically-programs of increasing functionality. Dynamically-linked programs will be more challenging, requiring the mmap system call among others.
This project is related to the project Extended Threads Emulation in an SMP Computer Simulator. It is part of the CC-NUMA Project.
The `visible machine' simulator was written using the Tcl, Tk, and Tix libraries. It has recently been fixed to run on Linux, but a port to Windows has not yet been achieved. The main problem is that support for Tcl, Tk and Tix is poor. For the current students, this is inconvenient as they are not able to set up the software on their home machines (even if it is a Linux machine, as installation of the above libraries is non-trivial).
This project will firstly rewrite the GUI part of the `visible machine' simulator using more contemporary infrastructure, such as GTK or Java / Python. For COMP6703/8770/80, an HCI evaluation of the current tool should be performed, and the interface for the new machine will be redesigned accordingly (and re-evaluated. For COMP8790, the process aspects of the project will be emphasized.
Last Modified: Peter Strazdins, July 2008