CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
School of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP8320 Multicore Computing Assignment 3, 2011

RCCE LINPACK Benchmark Components on on The Intel SCC Cloud Computer


Deadline: 17:00 on Tuesday 02 Friday 04 November


(Please report any errors, omissions and ambiguities to the course lecturer)
(amendments to this document after its release will be marked in blue).


This assignment is worth 10% of your total course mark. It will be marked out of 30 40.

Submission

This assignment must be submitted electronically. This can be done using the submit comp8320 ass3 ... command (from the student system). The file names which you should use (and the only ones that will be accepted by the submit command) are:

    report.pdf
    matmult.c
    paridamax.c

Extensions

See http://cs.anu.edu.au/student/comp8320/assessment.php. Late penalties are as follows
How LatePenalty from 60 Marks
less than 1 hour-1
between 1 and 6 hour-2
between 6 and 24 hours (1 day)-4
between 24 and 48 hours (2 days)-8
between 48 and 72 hours (3 days)-16
between 72 and 96 hours (4 days)-32
more than 96 hours (4 days)-64 (forget it!)

Note: late penalties apply even if you submit just one file after the deadline.

Plagiarism

This is an individual assignment. Any significant assistance received from anyone except the course staff must be acknowledged at the top of report.pdf. See http://cs.anu.edu.au/student/comp8320/assessment.php.

Setup

Copy the files (from CS student system) for the assignment into your home directory area:
    cp -r /dept/dcs/comp8320/public/ass3 .

The provided test programs matmultest.c and idamax.c test a RCCE distributed matrix multiply and index of absolute maximum computation, respectively. They are built using the command make.

The general usage for the idamax program is:

    rccerun_h -nue p idamax N
which computes the index of the absolute maximum of a vector of length N over SCC cores (UEs) numbered 0 to p-1. The index and value of the maximum must be reported by all UEs. The elements of the vector are spread over the UEs by the block distribution; the value of global element i is the same regardless of p. Thus, the parallel implementation of the algorithm can be be checked for correctness against the result for p=1 for the same N.

In the block distribution, process (UE in RCCE terminology) i, where 0 ≤ i < p holds the following elements of a vector of length N:

    i*(N/p), i*(N/p) + 1, ..., i*(N/p) + (N/p)-1
The above is exact when p | N; the general formula is
    i0, i0 + 1 , ..., i0 + n-1
where i0 = L2G0(N, i, p) and n = G2L(N, i, p) (see distlinaux.h for a definition of these functions).

The usage for the matmultest program is:

    rccerun_h -nue p matmultest [-p] [-v v] [-x px] [-r r] [-h] M N K
This program performs r repetitions of a distributed matrix multiply C += AB, where C is an M by N matrix distributed using the block distribution over a py by px process grid. A is an M by K matrix, whose rows are distributed using the block distribution over process column 0. B is an K by N matrix, whose columns are distributed using the block distribution over process column 0. The following diagram illustrates this over a 4 by 4 process grid, indicating two-dimensional process co-ordinates (idy,idx):


-v 0 uses RCCE broadcast to broadcast A (B) across the rows (columns) of the process grid. -v 1 uses a pipelined broadcast and -v 2 uses a binary-tree based broadcast instead. Refer to Laboratory Exercise 8 for algorithms and limited implementations of these algorithms. -v 3 uses a pipelined broadcast for A, and a tree broadcast for B. The High Performance Linpack Benchmark, a widely-used distributed version of the LINPACK Benchmark, typically uses this algorithm for its matrix multiply stage.

The -h option generates the matrices on UE 0, scatters the respective portions to others UEs before the multiply, and gathers the updated C matrix back to UE 0 for checking. The speed of the scatter/gather operations gives useful measurements of a single communication link in the SCC intra-chip network. However, such a model is not desirable for distributed memory algorithms, and without this option, the matrices are generated and checked in a distributed fashion.

Consider now a distributed version of LINPACK, using a two-dimensional block distribution. Assuming the lower (upper) panels are always within a single process column (row) -- the blocking factor w can always be adjusted to ensure this, the implied data movement in the algorithm is:

  • find P[i]: a reduction over the process column holding Li, combined with a broadcast down the column (and later across the rows).
  • row swap within Li: two message exchanges.
  • rank-1 update: broadcast ui down the process column holding Li.
  • row swaps outside the lower panel: 2w message exchanges.
  • matrix multiply: broadcast Li across process rows, and broadcast Ui down process rows.
It can be seen that the configuration in matmultest above mimics the matrix multiply in Linpack. In the above algorithm, all horizontal communication (P[i] and Li) occurs rightwards. For this reason, a pipelined broadcast can be utilized (having an effectively lower overhead than the binary-tree based broadcast, depending on the nature of the underlying communication network).

Accessing the SCC

Once given your password, log in to the SCC's front-end using ssh jolokia. Append to the bottom of your ~/.bashrc the line:
    source ~peter/comp8320/login/Bashrc
Append to the bottom of your ~/.profile the line:
    source ~/.bashrc
Once throughly debugged by the RCCE emulator, scp over your ass3 sub-directory to jolokia (ensure that the contents of this directory are not readable by other users). From that directory, the make command will copy the SCC executable into a directory called /shared/$USER, which is in the mount area of the SCC cores. (Only) from /shared/$USER, you will be able to run the rccerun_h command, as before. Note that in this directory, your test programs will be renamed as idamax_$USER and matmultest_$USER; rccerun_h on jolokia will only accept programs with the suffix _$USER.

For technical difficulties with the SCC, contact comp8320@cs.

Please consider the use of the SCC a privilege, and be mindful it is a resource shared by researchers as well as fellow students. The SCC can be considered a `device' in the same sense of a GPU, in that only one user at a time can (safely) use it (the provided rccerun_h command attempts to ensure this, but it is not foolproof). Please observe the following rules:

  • use the SCC only for performance measurements, and not debugging.
  • access it only via the rccerun_h command; do not under any circumstances try to log into the cores.
  • minimize the run-time of your jobs (e.g. keep under 60s - rccerun_h will enforce this anyway!).
Finally, note that the availability of the SCC cannot be guaranteed to the extent of other facilities that you have used so far. Therefore it is a bad strategy to leave your work to the last moment!

Your Tasks

  1. Implement in paridamax.c an algorithm to find the index (and signed value) of the absolute maximum element of a vector distributed across a number of processes using a block distribution. You may chose your method, but the more efficient the better. Justify your strategy in your report. Note while RCCE has some built-in reductions, it is not obvious how they can be adapted to the problem. Test the code thoroughly under the RCCE emulator.

  2. Implement in matmult.c the versions of distributed matrix multiply, as specified in matmult.h (see also above description of matmultest). Again test and debug your code thoroughly under the RCCE emulator before proceeding.

  3. In your report, write (pseudo-) code for a function to perform the row-swap operation to the left of the lower panel, e.g.
      A(i, 0:j-1) <-> A(pi, 0:j-1)
    Assume the function signature is as follows:
      RowSwapLeft(int j, int i, int pi, double* A, int ldA)
    where A is the pointer to the local storage for the matrix in the current process. Assume that the grid and process co-ordinates (px, py, idx and idy) are available as global variables, as is the matrix dimension N. Macros etc from distlinaux.h may be used.

    Also describe how you would modify your code to perform the swaps to the right of the panel ( A(i, j+w:n-1) <-> A(pi, j+w:n-1)).

  4. For a largish value of N, say 2000, determine the scaling behavior of your absolute maximum function (use powers of 2). Tabulate your results. What patterns do you observe? Try to explain these.

  5. Determine value of K that is `near-optimal' on a single SCC core for a square matrix multiply (using a moderately large value of N). Take `near optimal' as meaning the smallest K that is within 5% of the asymptotic performance (say K=100). Tabulate the results of your experiments. Use this value of K (unless otherwise specified) in future experiments.

  6. Effect of repeated computations. With a 1 by 8 process grid (i.e. px=8), choose a suitable value of N and compare the performance of the 3 versions (v=0,1,2) of square matrix multiply for both r=1 and r=10. Try to explain the difference in performance for the 2 values of r in each case. If possible, devise further experiments to test your conjectures.

  7. Effect of process grid aspect ratio.. Choose a suitable value of p that can be factorized in a number of ways, and a suitable value of N. Explore the effect of the aspect ratio on square matrix multiply, ranging from 1 by 1, 2 by p/2, ... p by 1, for the 3 versions. You may constrain the number of repetitions r = 2*px (this simulates the situation in the LINPACK computation) - note that you can do this nicely in bash as in the following example:
      px=4; rccerun_h -nue p matmultest -v 2 -x $px -r $((2*px)) N N K
    Tabulate your results and explain the differences; in particular, what aspect is optimal, and why? Hint: consider the amount of data flowing across individual network links.

  8. Scaling of matrix multiply. For the optimal (or near enough) aspect ratio, determine the scalability of matrix multiply (using the same value of N and K in the previous question, and chose your favorite value of v). That is, get results from 1 core to the maximum number for that ratio (roughly similar aspect ratio may be used for intermediate results), and tabulate your results.

    Calculate the speedups (performance in MFLOPs over that on a single core) -- this is called strong scaling. Do likewise for weak scaling: use the value N = Nm * px / pxm, where pxm is the value of px for your largest grid, and Nm refers to the value of N in the previous paragraph. The results should be quite different - try to explain them.

  9. Estimate the FLOP rate of the LINPACK Benchmark would have on its first `outer iteration', with the N and optimal aspect ratio above and using NB=K. The rate can be taken as the total number of FLOPs for the iteration (an exercise in simple maths) divided by the time it takes for the iteration. The latter can be broken down into the time it takes for the component calculations. That for the multiply you have measured; that for determining the pivots (absolute maximum) for you can determine from experiments such as from Q4 (noting this is repeated over NB columns; that for the rank-1 updates you can use a suitable experiment using matmultest (suggestion: use the average value for the width N = NB/2, and r=NB). The time for the triangular matrix update may be assumed to be the same as this. For the row swaps, consider only those outside the lower panel; you can determine the time by the amount of data that must pass through the vertical links in a column of the network divided by the rate (MB/s) that a link is capable of. The rate can be taken as the rate to scatter/gather the matrix from UE 0 when you run matmultest with -h. All other parts of the computation can be ignored.

    Include in your report the commands used to get these values.

  10. The block distribution is sub-optimal in terms of performance for LINPACK (and indeed a whole class of important block-partitioned dense matrix computations). Explain why this is (hint: consider the diagram for the distributed version of LINPACK). Recalling that we effectively used a block distribution for LINPACK on OpenMP (c.f. Assignment 1, Q7) and GPUs, explain why this was not a problem for those situations. From searching the literature (or otherwise), find what distributions are used for distributed memory versions of LINPACK and similar computations, and explain why they are better.

  11. Based on your experiences in this course, comment on the programmability (relative ease of designing, implementing and debugging programs) in the three paradigms you have used so far: OpenMP (shared memory), CUDA (streaming device programming) and RCCE (distributed memory). For the latter, take into account the fact that you had less time to learn, balanced with the fact that your tasks were more limited, and also the fact you were using a simpler but sub-optimal distribution. Critically evaluate the claims made in Kumar et al, The Case For Message Passing On Many-Core Chips.

Your report may be generated in your favorite format (this may include plain text!) but must be converted to standard PDF format before submission. The RCCE file(s) should contain your final (best) versions of your code, although interesting intermediate versions of code sections could be left there commented out (preferably with #ifdefs) and/or placed in your report.

Marking Guidelines

Your code must preserve correctness to obtain any marks for the associated task. Code tidiness, clarity, elegance and parallelization will be expected. Most marks will be allocated directly to the report. Quality of writing (grammar, sentence structure, organization) and clarity of explanations will be expected. Presentation of data is important. Don't include data/program output that is not relevant to the question being answered; don't present numbers with a meaningless number of digits.

In terms of content, the bottom line is to show good and interesting insights into performance issues for a numerically-intensive computation on this highly multicore co-processor. Little or no marks will be given for pasting voluminous and uninterpreted program output.

In your report, you need not go to a lot of trouble with typesetting tables and figures. In fact, plain text format will be quite adequate. However, your report should be tidy and presentable (including free from spelling and grammatical errors).

Hints

Since similar implementations have already been provided (e.g. in rank1update.c from Laboratory 8, the main issues will be in ensuring your have specific ed the parameters to the corresponding RCCE functions properly. If the program deadlocks (does not terminate after reasonable time), the error may be due to specifying the wrong process co-ordinates. The -p may be useful for small matrices.

Last modified: 25/11/2011, 13:48

Copyright | Disclaimer | Privacy | Contact ANU