ANU Computer Science Technical Reports
TR-CS-98-10
Peter Strazdins.
Optimal load balancing techniques for block-cyclic decompositions for
matrix factorization.
September 1998.
[POSTSCRIPT (164856 bytes)] [PDF (338101 bytes)] [EPrints archive]
Abstract: In this paper, we present a new load
balancing technique, called panel scattering, which is generally applicable
for parallel block-partitioned dense linear algebra algorithms, such as
matrix factorization. Here, the panels formed in such computation are divided
across their length, and evenly (re-)distributed among all processors. It is
shown how this technique can be efficiently implemented for the general
block-cyclic matrix distribution, requiring only the collective communication
primitives that required for block-cyclic parallel BLAS. In most situations,
panel scattering yields optimal load balance and cell computation speed
across all stages of the computation. It has also advantages in naturally
yielding good memory access patterns. Compared with traditional methods
which minimize communication costs at the expense of load balance, it has a
small (in some situations negative) increase in communication volume costs.
It however incurs extra communication startup costs, but only by a factor not
exceeding 2. To maximize load balance and minimize the cost of panel
re-distribution, storage block sizes should be kept small; furthermore, in
many situations of interest, there will be no significant communication
startup penalty for doing so.
Results will be given on the Fujitsu AP+
parallel computer, which will compare the performance of panel scattering
with previously established methods, for LU, LLT and QR factorization. These
are consistent with a detailed performance model for LU factorization for
each method that is developed here.
Technical Reports <Technical-DOT-Reports-AT-cs-DOT-anu.edu.au>
Last modified: Tue May 31 12:56:00 EST 2011