Coursework (including Honours) Project Proposals by Peter Strazdins

The following is a list of project proposals by Dr Peter Strazdins that are suitable for coursework projects (3rd year BIT, 4th year BSEng, Honours, Masters), valid for semester 1/2 2011:

Multicore Software Engineering (research/implementation)
This project will investigate experiences and methodologies in refactoring a non-trivial existing software application for efficient implementation on multicore processors. For example, one of the papers linked from the COMP8320 tutorial 4 page found that parallelizing the open source code base BZip2 proved non-trivial, even though the main computation involved is `embarrassingly parallel'.
This project will undertake such an exercise in refactoring, detailing and analyzing the experiences generated, and comparing them with the literature. (tine permitting), it will also investigate and evaluate current software engineering methodologies that are available for parallel programs 9such as the PADL methodology from Professional Multicore Programming: Design and Implementation for C++ Developers, Cameron Hughes and Tracey Hughes, Wiley, ISBN: 978-0-470-28962-4, 2008).
Taken by Sayan Bhattacharyya, s2 2010

Exploring the T2's Cryptographic Capabilities (research)
The T2 is primarily designed for highly concurrent applications such as database and web servers, where the combination of chip multi-processing and hardware multithreading enables it to realize a large degree of throughput. However, the chip has hardware support for cryptographic capabilities, with one Modular Arithmetic Unit and one Cipher/Hash Unit per processor core. There are claims of up to 100 times better performance on cryptographic capabilities compared with rival chip processors.
This project will investigate how these cryptographic functions are implemented on the T2, and evaluate performance for both simple benchmarks and applications that are intensive in cryptographic calculations.
Taken by Cody Christopher, s2 2010

Basic Linear Algebra kernels and LINPACK for the T2: (research/implementation):
The Niagara or CoolThreads (T) series of processors has been motivated by commercial applications such as web, database and eCommerce servers. Thus, they have been able to achieve high throughput on workloads with many execution threads, which are integer, memory and I/O intensive. However, it has come equipped with a floating point unit on each core. The question arises: how effective is it for (parallel) numerical applications, and how can it be best programmed for these?
This project will investigate how to program the matrix-matrix multiply kernel from the BLAS library, dgemm(), for the T2, optimizing it for both single core and multicore/thread performance. It will also evaluate its performance. Since dgemm() forms the computational heart of almost all dense linear algebra computations, it is surprising that there seems to be very little work done on this topic so far.
An extension/alternate project will be investigating the performance of multithreade3d LINPACK benchmarks. While its performance is dominated by dgemm(), load balancing issues turn out to be non=-trivial, due to the nature of the T2.

Evaluation of the T2's Multithreading Performance (research)
The T2 not only has 8 computational cores, but each core can concurrently support 8 computational threads. The question arises: how much speedup can you get from the multiple cores, and how much more can you get out of the hardware threading? Furthermore, what is the best way to synchronize software threads, running across different cores and hardware threads? In what ways is such a processor different from a traditional shared memory processor?
This project will build on the a previous project by Hugh Blemings based on the OpenPower 720. It involves evaluating the T2 on memory- and compute- intensive parallel benchmarks and synchronization algorithms.

References

From the T2 wiki, you can get to most things on the T2. There are plenty of information and links on the COMP8320 web page.


Research/implementation project proposal:
Automatic Tuning of the High Performance Linpack Benchmark (followup)

Description

Supercomputers today are large-scale parallel processors, and the Top 500 list is the most definitive ranking of the world's fastest. While the status of this list is contentious, the list is based on the results of a single benchmark - the Linpack benchmark - which in this case solves a (very) large dense linear system. The list is dominated by Beowulf-style cluster computers: a parallel computer using (typically Commercial-off-the-Shelf) switch-based network to communicate between the processors.

There is a publically available open-source code for a highly optimized benchmark called High Performance Linpack (HPL). It is also highly tunable, but the tuning currently must be done by hand. This is problematic, in the sense that it requires both knowledge if the complex algorithms used in HPL (beyond the understanding of most people who wish to use the benchmarks) and of the (cluster) supercomputer itself. Furthermore, in the case of clusters, this problem is exacerbated by the fact that the cluster may be non-homogeneous (so the benchmark results may depend on which parts of the cluster are selected), and by the fact that by their nature clusters may be easily reconfigured. Thus, it is highly desirable that this tuning process can be automated.

The LAPACK For Clusters (LFC) project went partway in addressing this, in that some work on determining, given p processors, to the best P x Q logical processor grid, where P*Q <= p, the number of available processors, was made. However, automatically choosing the critical blocking parameter and a plethora of algorithm variants remains to be done.

COMP3130 students Chris Fraser, Sotirios Diamand, and Sam Rathmanner (2009) made a beginning to this project. They implemented search methods including random, `linear', and Nelder-Mead simplex, which was coded using a Python infrastructure. Optimizations such as early-exit were implemented. However, the evaluation was very preliminary, and many of the interesting research issues remain unanswered.

This project will follow this up with a more comprehensive evaluation, and in particular look at challenging cases (e.g. where p has an awkward value to embed a P x Q grid). It will also evaluate the accuracy of early cut-off methods. From this starting point, refinements to the search methods will be investigated.

The SoCS Saratoga cluster will form a suitable experimental platform, with also the NCI clusters for larger tests. The project has the potential for publishable work and for the production of widely used open-source software.

References

See the links above, and also:


Research project proposal:
Parallelizing Applications on a Heterogeneous Cluster

A Beowulf-style cluster computer is a parallel computer using Commercial-off-the-Shelf switch-based network to communicate between the processors. The ANU Beowulf cluster Bunyip is such a cluster based on Fast (100Mb) Ethernet switches. Clusters have proved a highly cost-effective high performance computer model, and have largely displaced the traditional massively parallel computers built with expensive vendor-supplied networks.

A typical cluster as Bunyip tends to evolve, as new processors and networks (and the monies to pay for them!) become available. Thus clusters tend to become heterogeneous. Furthermore, for a particular application, there tends to be a particular cluster configurations that is optimal. For the case of a (high performance) computing service center, design to serve many such applications, a cluster may be designed to be heterogeneous from the beginning.

The efficient parallelization of applications using the Message Passing Interface (MPI) is reasonably well understood for homogeneous clusters. However, for heterogeneous clusters, which can not only have nodes with different processor architectures but also sections of nodes connected by different networks, the situation is still far more challenging. In particular, differences in processors and networks speeds make it difficult to utilize the faster components more effectively.

Using the Jabberwocky heterogeneous cluster, this project will investigate techniques used so far to parallelize (scientific/engineering) applications for heterogeneous clusters. The project will then make a selection of these applications and evaluate the effectiveness of these techniques on the Jabberwocky. Of particular interest is the Gravitational N-body problem; a potentially efficient algorithm was prototyped Travis Stenborg's thesis for COMP8800 (2009); however, a fully tuned algorithm and its rigorous evaluations remains to be done as a follow-up.

References

See the links above, and also:


Implementation/research project proposal:
Extended Threads Emulation in an SMP Computer Simulator

Description

Simulation is playing an increasingly important role in the design and evaluation of current and future high performance computer systems. Sparc-Sulima is an open-source UltraSPARC SMP simulator, written in C++ and developed by the CC-NUMA Project. Sparc-Sulima simulates an UltraSPARC processor by interpreting each instruction. It has timing models which predict an UltraSPARC III Cu shared memory processor performance to a high degree of accuracy.

Currently, Sparc-Sulima can simulate an unmodified Solaris (UltraSPARC) binary (executable program). When that program executes a system call trap instruction, the corresponding operating system effects are `emulated' (the simulated computer state is updated directly, rather than via simulating a lengthy and complex sequence of kernel-level instructions), and simulated execution then resumes. This mechanism is handled by the SolEmn emulation layer in Sparc-Sulima. Threaded programs (e.g. using the pthreads Posix thread libraries) ultimately access thread-related operating system services via the Solaris _lwp (lightweight process) trap instructions. The limitation of the current implementation is that, at any time, there can only be one active thread per CPU. Such a limitation has been in all emulation-based computer simulators in the literature also.

The aim of this project is to extend the emulation of the _lwp trap instructions to remove this limitation. It will require implementing thread switching, which can occur by time-outs, and also by the _lwp_suspend and yield system calls. The current thread context save/restore mechanisms will also have to be extended. If successful, the work should be publishable (along with the existing work on threads emulation).

An extension of this project, more implementation oriented, would be to simulate, rather then emulate the _lwp traps. This would involve writing a mini-kernel that would perform an implementation of lightweight threads. Now, when Sparc-Sulima executes a threaded program and encounters an _lwp trap, it will starts executing the kernel, just like on a real machine. Note that the same underlying algorithms of the emulation mode can be used; however, some considerations must now be given to the `locking' of shared data structures. The advantage of this is that kernel-level (notably memory system-related) overheads can now be recorded by the simulator. Thus, the simulator will be able to predict more accurately the threaded program's overheads.

An extension of this project, more research-oriented, would be to investigate which thread scheduling heuristics best approximate real machine behavior.

This project is related to the project Porting a SPARC-Solaris simulator to x86-Linux. It is part of the CC-NUMA Project. It has the potential for publishable work.

References


Implementation project proposal:
Porting a SPARC-Solaris simulator to x86-Linux

Description

Simulation is playing an increasingly important role in the design and evaluation of current and future high performance computer systems. Sparc-Sulima is an open-source UltraSPARC SMP simulator developed by the CC-NUMA Project. Sparc-Sulima simulates an UltraSPARC processor by interpreting each instruction. It has timing models which predict an UltraSPARC III Cu shared memory processor performance to a high degree of accuracy.

Currently, Sparc-Sulima can simulate an unmodified Solaris (UltraSPARC) binary (executable program). When the program executes a system call trap instruction, the corresponding operating system effects are `emulated' (the simulated computer state is updated directly, rather than via simulating a lengthy and complex sequence of kernel-level instructions), and simulated execution then resumes. This mechanism is handled by the SolEmn emulation layer in Sparc-Sulima. For example, threaded programs (e.g. using the pthreads Posix thread libraries) ultimately access thread-related operating system services via the Solaris _lwp (lightweight process) trap instructions.

A limitation of the current implementation of the SolEmn emulation mode is that it must be run on a Sparc-Solaris host. This limits greatly where the simulator can be used. Note however that simpler and more limited emulation modes have been ported to the x86-Linux; these modes can only run specially (and statically) linked binaries - and this excludes threaded programs!

The aim of this project is to `port' the Sparc-Sulima to the most important platform - x86/Linux! The majority of work will be in SolEmn: the emulation of Solaris system calls using Linux system calls. Since, there is not always a one-to-one relation, this is not trivial. The strategy would be to gradually extend the number of ported calls and test with statically-programs of increasing functionality. Dynamically-linked programs will be more challenging, requiring the mmap system call among others.

This project is related to the project Extended Threads Emulation in an SMP Computer Simulator. It is part of the CC-NUMA Project.

References

Last Modified: Peter Strazdins, Jul 2013