Coursework Project Proposals by Peter Strazdins

News 11/07/08: our UltraSPARC T2 has arrived!

The following is a list of project proposals by Dr Peter Strazdins that are suitable for coursework projects (3rd year BIT, 4th year BSEng, Honours, Masters), valid for semester 1/2 2008 :

For 2009, more proposals are expected (e.g. in virtualization, grids and service-oriented architecture, and distributed shared memory).

Research/implementation project proposal:
Various Topics on the T2 multi-core processor

Description

Evaluation of Multicore Programming Paradigms for Multicore Computing (research):
There is a plethora of programming paradigms available for multicore computing, ranging from library-based (Posix threads), annotation-based (e.g.\ OpenMP, Clik), object-based (e.g. Intel TBB), semi-implicit parallel (Fortress, X10, Chapel) and functional (e.g. Erlang). Some have emerged over the last few years and are still developing. Considerations include succinctness and ease-of-programming, (potential for) performance and scalability, dealing with locality and heterogeneity.
This project will review existing literature on these paradigms with particular attention being paid to the multicore targets. It will also develop a series of small programming case studies on the T2 (and possibly the OpenPower 720) in order to evaluate these paradigms. Not only is this project of high strategic importance to the teaching and research programs in DCS, it is potentially an immensely valuable educational experience!

Basic Linear Algebra kernels for the T2: (research/implemnation):
The Niagara or CoolThreads (T) series of processors has been motivated by commercial applications such as web, database and eCommerce servers. Thus, they have been able to achieve high throughput on workloads with many execution threads, which are integer, memory and I/O intensive. However, it has come equipped with a floating point unit on each core. The question arises: how effective is it for (parallel) numerical applications, and how can it be best programmed for these?
This project will investigate how to program the matrix-matrix multiply kernel from the BLAS library, dgemm(), for the T2, optimizing it for both single core and multicore/thread performance. It will also evaluate its performance. Since dgemm() forms the computational heart of almost all dense linear algebra computations, it is suprising that there seems to be very little work done on this topic so far.

Evaluation of the T2's Multithreading Performance: (research):
The T2 not only has 8 computational cores, but each core can concurrently support 8 computational threads. The question arises: how much speedup can you get from the multiple cores, and how much more can you get out of the hardware threading? Furthermore, what is the best way to synchronize software threads, running across different cores and hardware threads? In what ways is such a processor different from a traditional shared memory processor?
This project will build on the a previous project by Hugh Blemings based on the OpenPower 720. It involves evaluating the T2 on memory- and compute- intensive parallel benchmarks and synchronization algorithms.

Evaluation of Virtualization Capabilities - LDoms (research):
(this is a variant, or possibly an add-on, for the above topic) In order to realize its immense potential of power saving in the server context, the T2 supports virtualization, in which a single T2 physical machine can simultaneously support many virtual server machines for various clients. The mechanism for virtualization in the T2 is called Logical Domains. However, virtualization typically involves a computational cost.
This project will first undertake a review of the LDom virtualization mechanism, and then investigate the effect of running various benchmarks and applications under virtualizated and native domains. This project is of immediate interest to the DCS teaching program, as for security, and ease of managemnt considerations, it is desirable to export separate virtual computers for teaching a research puproses, provied the overheads for doing so are at acceptable (or at least understood).

References

From the T2 wiki, you can get to most things on the T2. Try Google seaching on the relevant keywords to find more about programming languages etc mentioned above. For a more general refernce, see The Coming Wave of Multithreaded Chip Multiprocessors.

Research/implementation project proposal:
Automatic Tuning of the High Performance Linpack Benchmark

Description

Supercomputers today are large-scale parallel processors, and the Top 500 list is the most definitive ranking of the world's fastest. While the status of this list is contentious, the list is based on the results of a single benchmark - the Linpack benchmark - which in this case solves a (very) large dense linear system. The list is dominated by Beowulf-style cluster computers: a parallel computer using (typically Commercial-off-the-Shelf) switch-based network to communicate between the processors.

There is a publically available open-source code for a highly optimized benchmark called High Performance Linpack (HPL). It is also highly tunable, but the tuning currently must be done by hand. This is problematic, in the sense that it requires both knowledge if the complex algorithms used in HPL (beyond the understanding of most people who wish to use the benchmarks) and of the (cluster) supercomputer itself. Furthermore, in the case of clusters, this problem is exacerbated by the fact that the cluster may be non-homogeneous (so the benchmark results may depend on which parts of the cluster are selected), and by the fact that by their nature clusters may be easily reconfigured. Thus, it is highly desirable that this tuning process can be automated.

The LAPACK For Clusters (LFC) project went partway in addressing this, in that some work on determining, given p processors, to the best logical P x Q logical processor configuration, where P*Q <= p was made. However, automatically choosing the critical blocking parameter and a plethora of algorithm variants remains to be done.

This project will investigate automatic tuning methods for the HPL package for cluster computers. The search can be done statically, and employ efficient empirical methods based on techniques such as the Nelder-Mead simplex search, as is done for the ATLAS package (which itself can be used as a component of HPL).

The implementation version of the project will begin by providing an implementation using simple empirical methods with an evaluation. The research version would investigate a variety of more sophisticated methods, and would evaluate their suitability in this context. This, and fact that the project spans the AI and CSys areas, makes this project suitable for the CS Group Project. The DCS Saratoga cluster will form a suitable experimental platform. The project has the potential for publishable work and for the production of widely used open-source software.

References

See the links above, and also:

Research project proposal:
Parallelizing Applications on a Heterogeneous Cluster

A Beowulf-style cluster computer is a parallel computer using Commercial-off-the-Shelf switch-based network to communicate between the processors. The ANU Beowulf cluster Bunyip is such a cluster based on Fast (100Mb) Ethernet switches. Clusters have proved a highly cost-effective high performance computer model, and have largely displaced the traditional massively parallel computers built with expensive vendor-supplied networks.

A typical cluster as Bunyip tends to evolve, as new processors and networks (and the monies to pay for them!) become available. Thus clusters tend to become heterogeneous. Furthermore, for a particular application, there tends to be a particular cluster configurations that is optimal. For the case of a (high performance) computing service center, design to serve many such applications, a cluster may be designed to be heterogeneous from the beginning.

The efficient parallelization of applications using the Message Passing Interface (MPI) is reasonably well understood for homogeneous clusters. However, for heterogeneous clusters, which can not only have nodes with different processor architectures but also sections of nodes connected by different networks, the situation is still far more challenging. In particular, differences in processors and networks speeds make it difficult to utilize the faster components more effectively.

Using the Jabberwocky heterogeneous cluster, this project will investigate techniques used so far to parallelize (scientific/engineering) applications for heterogeneous clusters. The project will then make a selection of these applications and evaluate the effectiveness of these techniques on the Jabberwocky. New or variant techniques may then be proposed and evaluated.

References

See the links above, and also:

Research/Implementation project proposal:
LINPACK Benchmark Performance on the Cell BroadBand Engine (PS/3)

Description

The LINPACK Benchmark involves a computation which solves a dense linear system of equations. This is dominated by the LU factorization of the associated matrix. The benchmark is arguably (contentiously!) the most important single benchmark in the high-performance computing world, and is used Top 500 list of supercomputers.

Recently, IBM / Sony / Toshiba developed the Cell BroadBand Engine, an early version of which was incorporated in the Sony Playstation 3 (PS/3). This processor, with its unique design, has an enormous potential for floating-point performance (single precision). But the programmability of this architecture remains an issue.

This project will begin by investigating prior work in the area. Implementation will begin by implementing both `right-looking' (e.g. LAPACK's sgetrf()) and left-looking versions of LU factorization algorithms, linked with IBM's SDK BLAS routines, which can make use of the CBE as a `function accelerator'. The performance of variants using different blocking factors, and determining which parts need to be offloaded from the main processor can be undertaken. Time permitting, programming the Cell's internal processing elements more directly may be investigated.

The implementation version of this project will involve developing the codes and a test harness to perform these evaluations. The research version will involve a deeper analysis into the performance, and look at more advanced methods.

References

See the links above, and:

Research/Implementation project proposal:
Tuning the Message Passing Interface library for large-scale Shared Memory / Multicore Processors

Description

The Message Passing Interface (MPI) is the standard programming paradigm for distributed memory parallel computing, and many large and important applications have already been written in terms of MPI, and have run successfully on recent cluster computers. With the recent advent of large-scale chip multi-processors, the issue emerges of how can these applications can reap the potentially massive performance of the new generation of clusters that will be comprised of a number of processor chips connected by shared memory, where each chip is a chip multi-processor.

The simplest method for such applications to run a separate MPI process on each CPU in the system, but the efficiency of this depends on whether the MPI implementation is sufficiently intelligent to make use of the fact that there will be large sub-groups of processes that are running on the same node and hence can communicate by much faster communication mechanisms.

This project will investigate the efficiency of the highly sophisticated OpenMPI implementation of MPI on large-scale shared memory nodes, such as the Freemont (and 8 x 2 CPU Opteron) and IBM OpenPower 720 (2 x 2 CPU x 2-way multithreading) systems hosted in DCS. It will begin by an investigation of the relevant algorithms used in OpenMPI, concentrating on barriers and collective operations, comparing them (where applicable) with corresponding operations using threaded programming paradigms.

The implementation version of the project will involve constructing relevant benchmarks and a test harness to perform the experiments. An evaluation of the current implementation of OpenMPI in this context will then be given. The research version of the project will also involve investigating alternate methods (from the literature), and evaluating these as well.

References


Implementation/research project proposal:
Extended Threads Emulation in an SMP Computer Simulator

Description

Simulation is playing an increasingly important role in the design and evaluation of current and future high performance computer systems. Sparc-Sulima is an open-source UltraSPARC SMP simulator, written in C++ and developed by the CC-NUMA Project. Sparc-Sulima simulates an UltraSPARC processor by interpreting each instruction. It has timing models which predict an UltraSPARC III Cu shared memory processor performance to a high degree of accuracy.

Currently, Sparc-Sulima can simulate an unmodified Solaris (UltraSPARC) binary (executable program). When that program executes a system call trap instruction, the corresponding operating system effects are `emulated' (the simulated computer state is updated directly, rather than via simulating a lengthy and complex sequence of kernel-level instructions), and simulated execution then resumes. This mechanism is handled by the SolEmn emulation layer in Sparc-Sulima. Threaded programs (e.g. using the pthreads Posix thread libraries) ultimately access thread-related operating system services via the Solaris _lwp (lightweight process) trap instructions. The limitation of the current implementation is that, at any time, there can only be one active thread per CPU. Such a limitation has been in all emulation-based computer simulators in the literature also.

The aim of this project is to extend the emulation of the _lwp trap instructions to remove this limitation. It will require implementing thread switching, which can occur by time-outs, and also by the _lwp_suspend and yield system calls. The current thread context save/restore mechanisms will also have to be extended. If successful, the work should be publishable (along with the existing work on threads emulation).

An extension of this project, more implementation oriented, would be to simulate, rather then emulate the _lwp traps. This would involve writing a mini-kernel that would perform an implementation of lightweight threads. Now, when Sparc-Sulima executes a threaded program and encounters an _lwp trap, it will starts executing the kernel, just like on a real machine. Note that the same underlying algorithms of the emulation mode can be used; however, some considerations must now be given to the `locking' of shared data structures. The advantage of this is that kernel-level (notably memory system-related) overheads can now be recorded by the simulator. Thus, the simulator will be able to predict more accurately the threaded program's overheads.

An extension of this project, more research-oriented, would be to investigate which thread scheduling heuristics best approximate real machine behavior.

This project is related to the project Porting a SPARC-Solaris simulator to x86-Linux. It is part of the CC-NUMA Project. It has the potential for publishable work.

References


Implementation project proposal:
Porting a SPARC-Solaris simulator to x86-Linux

Description

Simulation is playing an increasingly important role in the design and evaluation of current and future high performance computer systems. Sparc-Sulima is an open-source UltraSPARC SMP simulator developed by the CC-NUMA Project. Sparc-Sulima simulates an UltraSPARC processor by interpreting each instruction. It has timing models which predict an UltraSPARC III Cu shared memory processor performance to a high degree of accuracy.

Currently, Sparc-Sulima can simulate an unmodified Solaris (UltraSPARC) binary (executable program). When the program executes a system call trap instruction, the corresponding operating system effects are `emulated' (the simulated computer state is updated directly, rather than via simulating a lengthy and complex sequence of kernel-level instructions), and simulated execution then resumes. This mechanism is handled by the SolEmn emulation layer in Sparc-Sulima. For example, threaded programs (e.g. using the pthreads Posix thread libraries) ultimately access thread-related operating system services via the Solaris _lwp (lightweight process) trap instructions.

A limitation of the current implementation of the SolEmn emulation mode is that it must be run on a Sparc-Solaris host. This limits greatly where the simulator can be used. Note however that simpler and more limited emulation modes have been ported to the x86-Linux; these modes can only run specially (and statically) linked binaries - and this excludes threaded programs!

The aim of this project is to `port' the Sparc-Sulima to the most important platform - x86/Linux! The majority of work will be in SolEmn: the emulation of Solaris system calls using Linux system calls. Since, there is not always a one-to-one relation, this is not trivial. The strategy would be to gradually extend the number of ported calls and test with statically-programs of increasing functionality. Dynamically-linked programs will be more challenging, requiring the mmap system call among others.

This project is related to the project Extended Threads Emulation in an SMP Computer Simulator. It is part of the CC-NUMA Project.

References


Implementation project proposal:
An Improved GUI for the PeANUt Computer Simulator

Description

he ANU has long used the PeANUt virtual computer to teach introductory computer systems concepts. Currently, it is used in the course COMP2300. PeANUt has a manual and simulator software which runs Linux. A crucial part of the software is the `visible machine' simulator, which is a GUI displaying how registers and memory contents change as a program executes.

The `visible machine' simulator was written using the Tcl, Tk, and Tix libraries. It has recently been fixed to run on Linux, but a port to Windows has not yet been achieved. The main problem is that support for Tcl, Tk and Tix is poor. For the current students, this is inconvenient as they are not able to set up the software on their home machines (even if it is a Linux machine, as installation of the above libraries is non-trivial).

This project will firstly rewrite the GUI part of the `visible machine' simulator using more contemporary infrastructure, such as GTK or Java / Python. For COMP6703/8770/80, an HCI evaluation of the current tool should be performed, and the interface for the new machine will be redesigned accordingly (and re-evaluated. For COMP8790, the process aspects of the project will be emphasized.

Last Modified: Peter Strazdins, July 2008