COMP4300/6430:
Parallel Systems
Semester 1 2013
Assignment 1
Distributed Matrix Multiplication Using MPI
Deadline: 13:00 on Monday 15 April
(Please report any errors, omissions and ambiguities
to the course lecturer)
(Amendments to this document after its release in
non-draft will be marked in
blue)
This assignment is worth 20% of your total course mark. It will be marked out of 40 for COMP4300 students, and out of 50 for COMP6430 students.
Submission
This assignment must be submitted electronically. This can be done
using the following commands (according to your enrollment):
- submit comp4300 ass1 ass1.c README[.pdf]
- submit comp6430 ass1 ass1.c README[.pdf]
Objectives
- Implement a distributed algorithm (and variations) using message passing.
- Model the computation and communication costs of the algorithm.
- Measure the performance of the implemented code, and compare against the model.
Description
Setup
The file ass1.zip contains template files for the files that you will submit.
It also contains an example test harness harness.c, and a Makefile which should be used to compile the programs.
Copy these files into a local directory before you begin work on your programs.
You can compile the system with the simple command make.
The test harness calls a number of functions in turn to perform distributed matrix multiplication. After each call, it compares the resulting matrix C with the result of a local DGEMM. The functions are:
- mult_replicated
- mult_replicated2
- mult_summa
- mult_summa_treebcast
Your task is to implement these functions.
You can compile the harness code using the Makefile provided.
On the NCI system, we will use the Intel Math Kernel Library implementation of the BLAS and LAPACK operations. On XE, load MKL as follows:
module load intel-mkl/10.3.0
The provided ass1.c contains a naïve implementation of matrix multiplication as follows:
for (i=0; i<m; i++) {
for(j=0;j<n;j++) {
C(i,j) = 0.0;
for(l=0;l<k;l++) {
C(i,j) += A(i,l) * B(l,j);
}
}
}
Measure the performance of this code using MPI_Wtime. How does it compare to an identical multiplication performed using BLAS DGEMM?
Section 1: Replicated Data
-
Implement two different formulations of distributed dense matrix multiply C = A x B using replicated data. Options include: broadcasting an entire matrix, sending entire rows or columns of a matrix to each process, inner or outer product formulation, ...?
Each process computes some portion of the C matrix.
Data must be moved from process 0 to each process so that a process holds all data it requires to compute its portion of C.
You must implement the following C functions:
int mult_replicated(int m, int n, int k, double* a, double* b, double* c);
int mult_replicated2(int m, int n, int k, double* a, double* b, double* c);
The test harness program calls these functions.
The inputs are:
- m, n, k
- the matrix dimensions
- a
- the matrix A at process 0, dimensions m × k, i.e. m rows by k columns in column-major format
- b
- the matrix B at process 0, dimensions k × n
- c
- the matrix C at process 0, dimensions m × n (uninitialised)
Your code should distribute data, evaluate part of the matrix multiplication at each process, and finally combine the results at process 0.
Multiplication of subblocks of the matrix may be performed by a call to the BLAS subroutine dgemm.
Your code does not need to support alpha or beta scaling or transposed matrices.
-
Model the computation and communication time of both of your algorithms.
Show the costs separately for each phase of the computation. Is the algorithm communication or computation limited, or is there a transition point?
-
Measure the performance of your implementation on XE.
Enhance the harness program to time a number of calls to mult_replicated and mult_replicated2. Address the following points:
- What is the performance on a single process?
- How does this compare to peak FLOPS and the performance of a single-process DGEMM?
- Increase the number of processes, up to a maximum of 32. Complete the following table for both of your algorithms, for multiplication of square NxN matrices:
Mean time for matrix multiplication with replicated data
| No. Processes | N | Mean Time | GFLOP/s |
| 1 | 1000 | | |
| 2 | 1260 | | |
| 4 | 1587 | | |
| 8 | 2000 | | |
| 16 | 2520 | | |
| 32 | 3175 | | |
How do your algorithms scale with the number of processes? How does this compare with your cost models?
-
Measure strong scaling up to 32 processes for a problem size of N=4000. How does this compare with your cost model?
Section 2: SUMMA
-
Use the provided code for the SUMMA algorithm to perform distributed matrix multiplication.
The code for the SUMMA (function pdgemm described in the paper) is provided in the file summa.c.
It assumes that each place holds a block of the A and B matrices, and computes a block of the C matrix.
You must implement the following C function:
int mult_summa(int m, int n, int k, double* a, double* b, double* c);
This should distribute data to the appropriate processes and then call pdgemm with the correct parameters. This requires that you determine the dimensions of the local blocks of A, B and C for each process, and create row and column communicators required by SUMMA. You may choose whatever default value you think is sensible for the panel size nb.
- Model the computation and communication time for the algorithm provided.
- Measure performance on XE as described in section 1. How does this compare with the cost model?
Section 3: SUMMA, non-pipelined
-
Copy the code from pdgemm to pdgemm_nonpipelined and modify it so that it uses a tree broadcast (e.g. MPI_Bcast) instead of pipelined communications.
- Does this change the cost model? If so, how?
- Measure performance on XE and compare against the provided code.
Section 4 (COMP6430 students only):
Read the original paper on PUMMA (Choi et al. 1994). Summarise and compare the two algorithms. Address the following points:
- How are the matrices distributed in each algorithm?
- What are the main differences in communication and costs?
- Does PUMMA have any advantages over SUMMA, in terms of applicability or performance in particular situations?
Requirements
Your code must be written using C and MPI, with calls to BLAS and LAPACK subroutines and system libraries.
Calls to other libraries are not permitted.
You are required to submit the following files:
- the file ass1.c, with function definitions completed by you.
- a plain text file called README or a PDF called README.pdf containing:
- a disclaimer with your name and student number
stating what (if any) contributions from others
(not including course staff) are in your submission. Note that
significant contributions may require a revision of your final mark,
as this is intended to be an individual assignment;
- the names of all component files and their purpose (note that
intermediate or junk files should not be included in your submission);
-
your cost analysis of the algorithms as described above;
-
your performance measurements as described above;
-
any notable deficiencies, bugs, or issues with your work; and
-
(optional, not assessed) any feedback on what you found helpful
in doing this work, or what from the course could be improved in
helping you to carry out this work.
Your code should be written with good programming style (and so
should not produce any warnings when compiled with the -Wall
option!). It should be well commented, so as to enable the reader
(this includes the tutor/marker!) to understand how it works.
Identifiers should be meaningful and (unless their role is trivial)
commented where declared, any constants should be named rather than
hard-coded, and defensive programming should be used.
The latter includes checks being made for function calls that can return
an error status; once detected, a message should be generated
with a call to the system error function perror() and the
process should exit with a non-zero status.
It also includes using assert() to perform other kinds of checks, and the
freeing of any malloced data before (normal) exit
(this permits the mcheck library to detect any buffer
overwrites).
When running jobs on the XE system, please use the normal queue rather than the express queue, as the express queue is charged at 3x the standard usage units. Please be mindful that you share the allocation with all other students in this course!
Marking Scheme
| Section | Implementation | Model | Measurement | Total |
| Replicated data | 10 | 5 | 5 | 20 |
| SUMMA | 6 | 3 | 3 | 12 |
| SUMMA with tree broadcast | 4 | 2 | 2 | 8 |
| Total (COMP4300) | 20 | 10 | 10 | 40 |
| Section 4 (COMP6430 only) | Report | Total |
| Total (COMP6430) | +10 | 50 |
Extensions
See the course Administration page.
Late penalties are as follows:
| How Late | Penalty from 40 Marks |
| less than 1 hour | -1 |
| between 1 and 6 hours | -2 |
| between 6 and 24 hours (1 day) | -5 |
| between 24 and 48 hours (2 days) | -10 |
| between 48 and 72 hours (3 days) | -20 |
| more than 72 hours (4 days) | -40 (forget it!) |
Note: late penalties apply even if you submit just one file after the
deadline.