The parallel multiply program matmultest.c is invoked as follows:
mpirun -np p ./matmultest
[-p] [-t] [-x px] [-r r]
M N K
The program uses a standard BLAS library to perform (fast) local matrix multiply, which in this case is supplied by the Atlas library. For this session, you need to load the corresponding module:
mpirun -np 2 ./matmultest -x 2 8 8
4
(do not use the -t option at this stage). Try using
the -p option to see what the matrices look like. You
should find that the program normally works; however if the K
parameter is not a multiple of px and py, it will
fail on the MPI_Allgather() calls. This is because the code
assumes that K is a multiple of px and py,
in which case the local matrix sizes on process (idy,idx)
kA = G2L(K, idx, px) (kB = G2L(K, idy, py)) will be
constant across process rows (columns).
G2L(N, id, p) is a macro defined in distlinaux.h;
it computes the local length for a global length N on process
id of p processes.
Your task is to rectify this problem! To do so, you may use MPI_Allgatherv(), after appropriately initializing the size and offset arrays for their respective calls:
int sizeAs[px], offsAs[px], sizeBs[py], offsBs[py];Hint: to generate the values for sizeAs, call G2L() with differing values for the 2nd parameter. Note that these results will need to be scaled by the other local matrix dimension (m in this case) to get the required size. The function calcOffs() can then be used to compute the values for offsAs.
while [ $K -le 256 ] ; do mpirun -np 1 ./matmultest -r 10 1000 1000 $K; K=$(($K + 16)); done
Now run the batch file, What kind of speedups do you observe? Does repeating the multiply (1 > reps) give you better performance? Does a square process configuration out-perform a linear one? Is the transpose variant of the algorithm faster for the M x N x N multiplies? And are there any differences between these issues on a node (p=16) and on 4 nodes (p=64)?
Time permitting, try using a bigger $N and $M (say 4000).