ANU Computer Science Technical Reports

TR-CS-99-01


Peter E. Strazdins.
A dense complex symmetric indefinite solver for the Fujitsu AP3000.
May 1999.

[POSTSCRIPT (169854 bytes)] [PDF (325008 bytes)] [EPrints archive]


Abstract: This paper describes the design, implementation and performance of a parallel direct dense symmetric-indefinite solver routine. Such a solver is required for the large complex systems arising from electro-magnetic field analysis, such as are generated from the AccuField application. The primary target architecture for the solver is the Fujitsu AP3000, a distributed memory machine based on the UltraSPARC processor.

The routine is written entirely in terms of the DBLAS Distributed Library, recently extended for complex precision. It uses the Bunch-Kaufman diagonal pivoting method and is based on the LAPACK algorithm, with several modifications required for efficient parallel implementation and one modification to reduce the amount of symmetric pivoting. Currently the routine uses a standard BLAS computational interface and can use either the MPI, BLACS or VPPLib communication interfaces (the latter is only available under the APruntime V2.0 system for the AP3000).

The routine out-performs its equivalent LAPACK routine zsysv() by 14% when run on a 300 MHz UltraSPARC processor for a matrix of order 1601 and a single right hand size. Comparing from the zsysv() from Sun Performance Library 1.2, the overall speed gain is 55%, by the use of faster BLAS kernels recently developed at ANU. Thus a speed of 436 (double precision) MFLOPs is possible, with an execution time of 12.5s.

Using run-time settable parameters, the routine can use any logical P× Q processor grid, any (square) storage block size r and any algorithmic block size \omega. This enables performance tuning via trading off load balance and communication penalties, the latter being relatively higher than for LU or LLT solvers. For a matrix of order 10000 on a 16-node AP3000, best performance was achieved with P=Q=4, r=\omega=44 with an execution time of 254s. This represents a sustained speed of 5.2 GFLOPs and a parallel speedup of 12. The main obstacle to a higher speedup on the AP3000 is communication volume overheads.

Future work includes improving the effective bandwidth of various communication primitives, by directly manipulating the AP3000 SDRAM and KMEM message buffering memory, which can be accessed via low-level routines of the APruntime V2.0 run-time system. Also to be investigated is an enhanced diagonal pivoting algorithm which has a `lookahead' over a block of columns, which may yield both computational and communication advantages, enabling the solver to truly approach LLT speed. This can occur if a (very) high fraction of the diagonal elements of the matrix are large (but the matrix is not necessarily positive definite), as may be the case with the AccuField matrices.


Technical Reports <Technical-DOT-Reports-AT-cs-DOT-anu.edu.au>
Last modified: Tue May 31 12:56:00 EST 2011