Hands-On Session Parallelization Strategies #4:
Matrix Multiply

Objective: To understand 2-dimensional data partitioning strategies and increase knowledge of communication patterns using the example of the matrix multiply.
The code for this exercise can be found here. Instructions on how to log into the remote machine and how to download the source code to your working directory (using wget) can be found here.

The parallel multiply program matmultest.c is invoked as follows:

It performs r repetitions of an M x N x K matrix multiply C+=AB on p=py*px processes. If -t is given , it performs C+=AB^T instead. The -p can be used to print arrays (for debugging!). The program generates matrices such that global element (i,j) has a value 2*i+j. This is useful as the original matrices (and their products) have a regular, predictable pattern, which in turn aids in debugging. It also enables very fast error checking! To perform the actual multiply, it calls the function matMultNN() in matmult.c.

The program uses a standard BLAS library to perform (fast) local matrix multiply, which in this case is supplied by the Atlas library. For this session, you need to load the corresponding module:

  1. Compile and run the program interactively for small matrix sizes and p, e.g. mpirun -np 2 ./matmultest -x 2 8 8 4 (do not use the -t option at this stage). Try using the -p option to see what the matrices look like. You should find that the program normally works; however if the K parameter is not a multiple of px and py, it will fail on the MPI_Allgather() calls. This is because the code assumes that K is a multiple of px and py, in which case the local matrix sizes on process (idy,idx)  kA = G2L(K, idx, px) (kB = G2L(K, idy, py)) will be constant across process rows (columns). G2L(N, id, p) is a macro defined in distlinaux.h; it computes the local length for a global length N on process id of p processes.

    Your task is to rectify this problem! To do so, you may use MPI_Allgatherv(), after appropriately initializing the size and offset arrays for their respective calls:

     
      int sizeAs[px], offsAs[px], sizeBs[py], offsBs[py];
    
    Hint: to generate the values for sizeAs, call G2L() with differing values for the 2nd parameter. Note that these results will need to be scaled by the other local matrix dimension (m in this case) to get the required size. The function calcOffs() can then be used to compute the values for offsAs.

  2. Find the (near-enough) optimal value of K for serial computation. To do this, you can use the script: Initialize the variable K in batch_matmult accordingly.

    Now run the batch file, What kind of speedups do you observe? Does repeating the multiply (1 > reps) give you better performance? Does a square process configuration out-perform a linear one? Is the transpose variant of the algorithm faster for the M x N x N multiplies? And are there any differences between these issues on a node (p=16) and on 4 nodes (p=64)?

    Time permitting, try using a bigger $N and $M (say 4000).