COMP4300/6430:
Parallel Systems

Semester 1 2013

Assignment 1

Distributed Matrix Multiplication Using MPI

Deadline: 13:00 on Monday 15 April


(Please report any errors, omissions and ambiguities to the course lecturer)
(Amendments to this document after its release in non-draft will be marked in blue)

This assignment is worth 20% of your total course mark. It will be marked out of 40 for COMP4300 students, and out of 50 for COMP6430 students.

Submission

This assignment must be submitted electronically. This can be done using the following commands (according to your enrollment):

Objectives

Description

Setup

The file ass1.zip contains template files for the files that you will submit. It also contains an example test harness harness.c, and a Makefile which should be used to compile the programs. Copy these files into a local directory before you begin work on your programs. You can compile the system with the simple command make.

The test harness calls a number of functions in turn to perform distributed matrix multiplication. After each call, it compares the resulting matrix C with the result of a local DGEMM. The functions are:

Your task is to implement these functions.

You can compile the harness code using the Makefile provided. On the NCI system, we will use the Intel Math Kernel Library implementation of the BLAS and LAPACK operations. On XE, load MKL as follows:

module load intel-mkl/10.3.0

The provided ass1.c contains a naïve implementation of matrix multiplication as follows:

    for (i=0; i<m; i++) {
        for(j=0;j<n;j++) {
            C(i,j) = 0.0;
            for(l=0;l<k;l++) {
                C(i,j) += A(i,l) * B(l,j);
            }
        }
    }

Measure the performance of this code using MPI_Wtime. How does it compare to an identical multiplication performed using BLAS DGEMM?

Section 1: Replicated Data

  1. Implement two different formulations of distributed dense matrix multiply C = A x B using replicated data. Options include: broadcasting an entire matrix, sending entire rows or columns of a matrix to each process, inner or outer product formulation, ...?
    Each process computes some portion of the C matrix. Data must be moved from process 0 to each process so that a process holds all data it requires to compute its portion of C.
    You must implement the following C functions:
    int mult_replicated(int m, int n, int k, double* a, double* b, double* c);
    int mult_replicated2(int m, int n, int k, double* a, double* b, double* c);
    

    The test harness program calls these functions. The inputs are:

    m, n, k
    the matrix dimensions
    a
    the matrix A at process 0, dimensions m × k, i.e. m rows by k columns in column-major format
    b
    the matrix B at process 0, dimensions k × n
    c
    the matrix C at process 0, dimensions m × n (uninitialised)

    Your code should distribute data, evaluate part of the matrix multiplication at each process, and finally combine the results at process 0. Multiplication of subblocks of the matrix may be performed by a call to the BLAS subroutine dgemm.
    Your code does not need to support alpha or beta scaling or transposed matrices.

  2. Model the computation and communication time of both of your algorithms.
    Show the costs separately for each phase of the computation. Is the algorithm communication or computation limited, or is there a transition point?
  3. Measure the performance of your implementation on XE.
    Enhance the harness program to time a number of calls to mult_replicated and mult_replicated2. Address the following points:
  4. Measure strong scaling up to 32 processes for a problem size of N=4000. How does this compare with your cost model?

Section 2: SUMMA

  1. Use the provided code for the SUMMA algorithm to perform distributed matrix multiplication.
    The code for the SUMMA (function pdgemm described in the paper) is provided in the file summa.c. It assumes that each place holds a block of the A and B matrices, and computes a block of the C matrix. You must implement the following C function:
    int mult_summa(int m, int n, int k, double* a, double* b, double* c);
    

    This should distribute data to the appropriate processes and then call pdgemm with the correct parameters. This requires that you determine the dimensions of the local blocks of A, B and C for each process, and create row and column communicators required by SUMMA. You may choose whatever default value you think is sensible for the panel size nb.

  2. Model the computation and communication time for the algorithm provided.
  3. Measure performance on XE as described in section 1. How does this compare with the cost model?

Section 3: SUMMA, non-pipelined

  1. Copy the code from pdgemm to pdgemm_nonpipelined and modify it so that it uses a tree broadcast (e.g. MPI_Bcast) instead of pipelined communications.
  2. Does this change the cost model? If so, how?
  3. Measure performance on XE and compare against the provided code.

Section 4 (COMP6430 students only):

Read the original paper on PUMMA (Choi et al. 1994). Summarise and compare the two algorithms. Address the following points:

Requirements

Your code must be written using C and MPI, with calls to BLAS and LAPACK subroutines and system libraries. Calls to other libraries are not permitted.

You are required to submit the following files:

Your code should be written with good programming style (and so should not produce any warnings when compiled with the -Wall option!). It should be well commented, so as to enable the reader (this includes the tutor/marker!) to understand how it works. Identifiers should be meaningful and (unless their role is trivial) commented where declared, any constants should be named rather than hard-coded, and defensive programming should be used. The latter includes checks being made for function calls that can return an error status; once detected, a message should be generated with a call to the system error function perror() and the process should exit with a non-zero status. It also includes using assert() to perform other kinds of checks, and the freeing of any malloced data before (normal) exit (this permits the mcheck library to detect any buffer overwrites).

When running jobs on the XE system, please use the normal queue rather than the express queue, as the express queue is charged at 3x the standard usage units. Please be mindful that you share the allocation with all other students in this course!

Marking Scheme

SectionImplementationModelMeasurementTotal
Replicated data 105520
SUMMA63312
SUMMA with tree broadcast4228
Total (COMP4300)20101040
Section 4 (COMP6430 only)ReportTotal
Total (COMP6430)+1050

Extensions

See the course Administration page. Late penalties are as follows:
How LatePenalty from 40 Marks
less than 1 hour-1
between 1 and 6 hours-2
between 6 and 24 hours (1 day)-5
between 24 and 48 hours (2 days)-10
between 48 and 72 hours (3 days)-20
more than 72 hours (4 days)-40 (forget it!)

Note: late penalties apply even if you submit just one file after the deadline.