A Parallel Iterative Linear System Solver with Dynamic Load Balancing

This thesis describes the design and implementation of a parallel iterative linear system solver - called PAISS (Parallel Adaptive Iterative linear System Solver) - for distributed memory multicomputers and workstation clusters. PAISS provides heterogeneous data distribution and dynamic load balancing (DLB) within a solver routine at matrix level. Both features are new in parallel iterative linear system solving, but are becoming quite important, as new hardware platforms are used for parallel computing, that are no longer homogeneous and static. With the increased popularity of workstation clusters employed in scientific computing, the need for flexible parallel programs that adjust to the available computational performance of the processors is clearly required.
    On a modern workstation the computational performance available for a single process depends upon jobs running concurrently and upon other users logging on and off. The same holds for modern distributed memory multicomputers, like the IBM SP-2, as they are often built like workstation clusters connected by a high-speed interconnection network. Users can utilize just a single processor, a subset, or all available processors, both running serial and/or parallel programs in interactive or batch mode. Load balancing on such machines mostly consists of a load distribution system, that starts new jobs on the least loaded available processors. A redistribution of jobs or work to processors that become idle is usually not done.

The presented solver PAISS initially distributes data (i.e. the coefficient matrix and vectors as used by an iterative method) heterogeneously onto the used processors according to their currently available computational performances, which are measured with a micro benchmark. As the load - and thus the available computational performance for an individual process - may change dynamically, a redistribution of data may be necessary to achieve a balanced load and thus a minimal run time. Balancing the load in PAISS is done by removing data (i.e. the rows of a coefficient matrix and vector elements) from overloaded processors and by distributing it to others.
    Appropriate new matrix data structures - both for dense and unstructured sparse matrices - have been designed and included in PAISS, that allow a dynamic redistribution at run time, but still achieve good floating-point performances (compared to static matrix data structures as implemented in other parallel iterative solvers). Timing experiments on different hardware platforms have shown that the new designed dynamic matrix data structures are almost as efficient as static ones.
    A basic DLB algorithm has been developed, that redistributes matrix rows and corresponding vector elements by removing them from overloaded processors and allocating them to others. As the overhead introduced by DLB in an iterative solver routine is mainly the communication of matrix rows, a main effort has been made in reducing this overhead appropriately. Six different methods have been implemented to redistribute matrix rows. The first performs the redistribution with blocking communication, while the second and third methods allow an overlapping of the DLB communication with useful computation (e.g. a matrix-vector multiplication). In the fourth method, only the most overloaded processor is permitted to send data to other processors, so communication is reduced (this method is only useful on shared networks like Ethernet or Token-Ring). Finally, two methods operate with overlapping matrix regions, i.e. matrix rows which are stored redundantly on several processors (but are only used on one processor within a matrix-vector multiplication). DLB is then performed in a logical fashion by adjusting the set of used rows within the bounds of the stored rows on a processor.
    An essential part of a DLB algorithm is the measurement of all processor's loads. In PAISS, the wall clock time spent in the computational routines is measured and distributed in each iteration. This approach is totally system independent and it introduces almost no overhead. The DLB algorithm then attempts to balance these computational times on all processors by appropriately distributing data.
    PAISS is intended to solve large sparse (or dense) linear systems of equations with iterative methods like Conjugate Gradient (CG). Iterative solver routines can be implemented by using the basic routines that are provided by PAISS. Preconditioning routines that only use the given arithmetic operations (i.e. matrix-vector multiplication, vector additions (DAXPY) and dot products) can also be implemented easily. As an example, the CG method with diagonal and polynomial preconditioning has been implemented and tested. A PAISS solver routine can be run in serial or parallel mode without any changes in the code.

Experiments on different hardware platforms have shown the suitability of the DLB approach as applied within PAISS. Tests have been performed with dense, band as well as sparse matrices from the Harwell-Boeing collection. Compared with the state-of-the-art parallel iterative solver package Aztec, the results achieved by PAISS are quite good. On an idle homogeneous platform, similar run times have been measured for both packages, whereas on a heterogeneous or dynamic platform PAISS often achieves better results than Aztec.
    The PAISS implementation is written in ANSI C and uses MPI for communication. A first prototype runs on SUN workstations. Additionally, it has been ported to an IBM SP-2 and to a Swiss SCS Gigabooster.

Back


Peter Christen
April 5, 1999