COMP4300/6430 2013

Assignment 1 sample solution

Setup (not marked)

The naive implementation is significantly slower than MKL’s DGEMM implementation. MKL is highly optimised to use vector instructions and blocking for registers and cache.

The number of FLOPs required for matrix multiplication (both naive and DGEMM) is approximately 2mnk. Therefore we calculate GFLOP/s as (2mnk /(1.0e9 × time)). Peak FLOP/s for a single core of XE is 12.0 GFLOP/s (3.0GHz × 4 FLOPs per cycle). DGEMM gets quite close to this for matrices larger than about 200x200 - see figure 1. In comparison, the naive implementation achieves less than 1/8 of peak FLOPS for small matrices, and performs even worse for large matrices due to cache effects.

Figure 1: Performance of Naive vs. Intel MKL DGEMM on XE

Figure 1a: Matrix multiplication time: Naive vs. Intel MKL DGEMM on XE

Figure 1b: Matrix multiplication floating point performance: Naive vs. Intel MKL DGEMM on XE

Note on MPI error handling
The default error handler for MPI programs is MPI_ERRORS_ARE_FATAL, which means that an MPI error will abort the entire program. This means that it is allowable for user programs to call MPI functions without checking for errors after each call. See MPI 2.2 Error Handling.

Section 1: Replicated Data

  1. See mult_replicated and mult_replicated2 in attached ass1.c.
    (Note: these are only two possible solutions - there are many other possibilities.)

  2. mult_replicated divides p processes into a P×Q grid. Process 0 sends a row panel mP×k of A and a column panel k×nQ of B to each process in turn. Each process calculates the corresponding mP×nQ block of matrix C. Each process then sends its block of C to process 0.
    Assuming square matrices and a square grid of processes, the execution times for each phase of this algorithm are approximately:

Computation and communication are not overlapped, so the total time is the sum of both. Given that, on XE, tstup>tdatatflop, we would expect communication to dominate for small matrix sizes. Communication is Op+pN2 and computation is O N 3 p , so the algorithm is computation-limited for large matrix sizes. As communication time increases with both the problem size and the square root of the number of processes, weak scaling (increasing the problem size with the number of processes so that each process has a constant amount of computation) will be increasingly dominated by communication.

mult_replicated2 broadcasts A to all p processes (using MPI_Bcast) and sends a column panel k×np of B to each process in turn. Each process calculates the corresponding m×np column-panel of matrix C. Each process then sends its column-panel of C to process 0.
The execution times for each phase of this algorithm are approximately:

assuming square matrices and a square grid of processes. Similar considerations apply as for mult_replicated above, except that as communication is Olog(p)N2 rather than OpN2 we would expect the communication time to increase more slowly with weak scaling.

  1. Performance on a single process is less than for DGEMM alone: for square matrices of size N=1000, mult_replicated and mult_replicated2 achieve only 9.40 and 9.98 GFLOP/s respectively, compared to 10.8 GFLOP/s for DGEMM. This is due to the extra time taken to allocate memory and move data between the provided matrices A, B, C and the temporary storage for these matrices.
    Peak FLOPS on XE is 12.0 GFLOP/s - neither algorithm gets as close to this number as does DGEMM. It's very rare for any algorithm to achieve peak FLOPS as this requires an exact match between the instructions issued (load, store, multiply, add) and the capabilities of the architecture, as well the hiding of all memory latencies.

    Weak scaling

    Table 1: weak scaling for matrix multiplication with mult_replicated
    No. ProcessesNMean Time (s)GFLOP/sGFLOP/s per node
    110000.2139.409.40
    212600.25016.08.00
    415870.29826.86.70
    820000.52030.83.84
    1625200.72044.52.78
    3231751.4444.31.38

    Table 2: weak scaling for matrix multiplication with mult_replicated2
    No. ProcessesNMean Time (s)GFLOP/sGFLOP/s per node
    110000.2009.989.98
    212600.23916.88.40
    415870.27828.77.18
    820000.39840.35.03
    1625200.53260.23.76
    3231750.76084.22.63

    mult_replicated scales poorly with the number of processes - while total GFLOP/s increases up to 16 processes, GFLOP/s per node drops rapidly. As predicted, weak scaling becomes dominated by communication time, with 32 processes actually achieving lower total GFLOP/s than 16 processes.
    mult_replicated2 scales better, continuing to increase total GFLOP/s to 32 processes. Efficiency still drops rapidly, although less than for mult_replicated as predicted.
  2. Strong scaling

    Table 3: strong scaling for matrix multiplication with mult_replicated
    No. ProcessesNMean Time (s)GFLOP/sGFLOP/s per node
    1409612.511.011.0
    240966.9619.79.85
    440964.0134.28.55
    840962.8248.66.08
    1640962.1663.63.98
    3240962.4256.81.78

    Strong scaling is simpler to compare with the cost model as N is fixed, therefore mult_replicated scales as Op+1p.
    This suggests a crossover point, a value of p for which increasing communication time outweighs the 1/p reduction in computation. After this point, adding further processes will increase the total time. We see such a crossover point in the measured timings between 16 and 32 processes.

    Table 4: strong scaling for matrix multiplication with mult_replicated2
    No. ProcessesNMean Time (s)GFLOP/sGFLOP/s per node
    1409612.311.211.2
    240966.4821.210.6
    440963.7636.69.14
    840962.4556.17.01
    1640961.7280.15.01
    3240961.321043.25

    The communication time for mult_replicated2 grows more slowly with the number of processes; it scales as Op+1log(p). This also suggests a crossover point, however it will be at a higher value of p than for mult_replicated. We do not observe any crossover in the measurements: the total time continues to decrease to p=32.

Section 2: SUMMA

  1. See mult_summa in attached ass1.c.

  2. From the SUMMA paper, the execution times for the algorithm are approximately:

    In addition, for our code, communications are required to distribute matrices A and B and gather C at the end.

  3. Measured times for SUMMA as follows:

Note: times from here on are from the Vayu system

Table 5: weak scaling for matrix multiplication with mult_summa
No. ProcessesNMean Time (s)GFLOP/sGFLOP/s per node
110000.18910.610.6
212600.21518.69.32
415870.22835.18.77
820000.25163.77.97
1625200.32399.06.18
3231750.4441444.50

Table 6: strong scaling for matrix multiplication with mult_summa
No. ProcessesNMean Time (s)GFLOP/sGFLOP/s per node
1409612.011.511.5
240966.2322.111.1
440963.2941.810.5
840961.8275.69.45
1640961.091277.91
3240960.7671795.60

Weak scaling is better than strong scaling - which is surprising as SUMMA is meant to exhibit linear weak scaling! The reason is that the problem size is small, so execution time is dominated by Op scattering and gathering of data not included in the SUMMA algorithm.

Section 3: SUMMA, non-pipelined

  1. See mult_summa in attached ass1.c. Note the use of the function do_summa with a function parameter for both pipelined and non-pipelined versions of SUMMA.

  2. Using a tree broadcast instead of the pipelined ring broadcast does change the cost model, multiplying SUMMA communications by a factor of log(p).

  3. Measured times for non-pipelined SUMMA as follows:

Table 7: weak scaling for matrix multiplication with mult_summa_treebcast
No. ProcessesNMean Time (s)GFLOP/sGFLOP/s per node
110000.18610.810.8
212600.19520.510.2
415870.22735.28.80
820000.25961.97.73
1625200.34193.95.87
3231750.5031273.97

Table 8: strong scaling for matrix multiplication with mult_summa_treebcast
No. ProcessesNMean Time (s)GFLOP/sGFLOP/s per node
1409612.011.511.5
240965.9023.311.6
440963.2941.810.4
840961.8375.19.38
1640961.141207.52
3240960.8591605.00

Performance is initially better than mult_summa for both weak and strong scaling tests, probably due to the efficiency of calling MPI_Bcast compared to a series of blocking MPI_Send calls. However the additional factor of log(p) soon takes over and tree broadcast is less efficient for 8 or more processes.

Section 4: (COMP6430 students only) PUMMA

PUMMA uses a block-scattered (also called block-cylic) distribution, in which a matrix is decomposed into blocks of size r × s. SUMMA uses a block distribution, which is a special case of the block-scattered distribution for which r=m/P and s=n/Q.
PUMMA uses standard tree broadcast for blocks of the matrix B, while rolling blocks of A between processes in a row, similar to the ring broadcast in the SUMMA algorithm. PUMMA requires more workspace (memory) per node. The block-scattered distribution may be advantageous for applications that use matrix multiplication with other linear algebra operations, as it provides better load balance for some operations, for example LU factorization. While it is possible to redistribute matrices between operations to use the best distribution for each operation, it is preferable to avoid the overhead of redistribution. PUMMA avoids the need for redistribution by using a decomposition that provides a good load balance for many linear algebra operations.