COMP8320 Multicore Computing
Assignment 3, 2011
RCCE LINPACK Benchmark Components on on The
Intel SCC Cloud Computer
Deadline: 17:00 on
Tuesday 02 Friday 04 November
(Please report any errors, omissions and ambiguities
to the course lecturer)
(amendments to this document after its release will be marked in
blue).
This assignment is worth 10% of your total course mark. It will be
marked out of
30 40.
Submission
This assignment must be submitted electronically. This can be done
using the
submit comp8320 ass3 ... command (from the student
system).
The file names which you should use (and the
only ones that will be accepted by the submit command) are:
report.pdf
matmult.c
paridamax.c
Extensions
See
http://cs.anu.edu.au/student/comp8320/assessment.php.
Late penalties are as follows
| How Late | Penalty from 60 Marks |
|---|
| less than 1 hour | -1 |
| between 1 and 6 hour | -2 |
| between 6 and 24 hours (1 day) | -4 |
| between 24 and 48 hours (2 days) | -8 |
| between 48 and 72 hours (3 days) | -16 |
| between 72 and 96 hours (4 days) | -32 |
| more than 96 hours (4 days) | -64 (forget it!) |
Note: late penalties apply even if you submit just one file after the
deadline.
Plagiarism
This is an individual assignment. Any significant assistance received
from anyone except the course staff must be acknowledged at the top of
report.pdf. See
http://cs.anu.edu.au/student/comp8320/assessment.php.
Setup
Copy the files (from CS student system) for the assignment
into your home directory area:
cp -r /dept/dcs/comp8320/public/ass3 .
The provided test programs matmultest.c
and idamax.c test a RCCE distributed matrix multiply
and index of absolute maximum computation, respectively.
They are built using the command make.
The general usage for the idamax program is:
rccerun_h -nue p idamax N
which computes the index of the absolute maximum of a vector of length
N over SCC cores (UEs) numbered 0 to
p-1. The index
and value of the maximum must be reported by all UEs. The elements of
the vector are spread over the UEs by the
block distribution;
the value of global element
i is the same regardless of
p. Thus, the parallel implementation of the algorithm can be
be checked for correctness against the result for
p=1 for the
same
N.
In the block distribution, process (UE in RCCE terminology)
i, where 0 ≤ i < p holds the following
elements of a vector of length N:
i*(N/p), i*(N/p) + 1, ..., i*(N/p) + (N/p)-1
The above is exact when
p | N; the general formula is
i0, i0 + 1 , ..., i0 + n-1
where
i0 = L2G0(N, i, p) and
n = G2L(N, i, p)
(see
distlinaux.h for a definition of these functions).
The usage for the matmultest program is:
rccerun_h -nue p matmultest [-p] [-v v]
[-x px] [-r r] [-h] M N K
This program performs
r repetitions of a distributed matrix
multiply
C += AB, where
C is an
M by
N matrix distributed using the
block distribution over a
py by
px process grid.
A is an
M by
K matrix, whose rows are distributed using the
block distribution over process column 0.
B is an
K by
N matrix, whose columns are distributed using the
block distribution over process column 0. The following diagram
illustrates this over a
4 by
4 process grid,
indicating two-dimensional process co-ordinates
(idy,idx):
-v 0 uses RCCE broadcast to broadcast A (B)
across the rows (columns) of the process grid. -v 1 uses a
pipelined broadcast and -v 2 uses a binary-tree based
broadcast instead. Refer to Laboratory Exercise
8 for algorithms and limited implementations of these algorithms.
-v 3 uses a pipelined broadcast for A, and a tree
broadcast for B. The High Performance Linpack
Benchmark, a widely-used distributed version of the LINPACK Benchmark,
typically uses this algorithm for its matrix multiply stage.
The -h option generates the matrices on UE 0, scatters
the respective portions to others UEs before the multiply, and
gathers the updated C matrix back to UE 0 for checking. The
speed of the scatter/gather operations gives useful measurements of a
single communication link in the SCC intra-chip network. However,
such a model is not desirable for distributed memory algorithms, and
without this option, the matrices are generated and checked in a
distributed fashion.
Consider now a distributed version of
LINPACK, using a two-dimensional block distribution. Assuming
the lower (upper) panels are always within a single process column (row)
-- the blocking factor w can always be adjusted to ensure this,
the implied data movement in the algorithm is:
- find P[i]: a reduction over the process column
holding Li,
combined with a broadcast down the column (and later across the rows).
- row swap within Li: two message exchanges.
- rank-1 update: broadcast ui down the process column
holding Li.
- row swaps outside the lower panel: 2w message exchanges.
- matrix multiply: broadcast Li across process rows,
and broadcast Ui down process rows.
It can be seen that the configuration in
matmultest above
mimics the matrix multiply in Linpack. In the above algorithm, all horizontal
communication (
P[i] and
Li) occurs rightwards.
For this reason, a
pipelined broadcast can be utilized
(having an effectively lower overhead than the binary-tree based broadcast,
depending on the nature of the underlying communication network).
Accessing the SCC
Once given your password,
log in to the SCC's front-end using
ssh jolokia.
Append to the bottom of your
~/.bashrc the line:
source ~peter/comp8320/login/Bashrc
Append to the bottom of your
~/.profile the line:
Once throughly debugged by the RCCE emulator,
scp
over your
ass3 sub-directory to
jolokia (ensure that
the contents of this directory are not readable by other users).
From that directory, the
make command will copy the SCC
executable into a directory called
/shared/$USER, which is
in the mount area of the SCC cores. (Only) from
/shared/$USER,
you will be able to run the
rccerun_h command, as before.
Note that in this directory, your test programs
will be renamed as idamax_$USER and
matmultest_$USER; rccerun_h on jolokia will
only accept programs with the suffix _$USER.
For technical difficulties with the SCC, contact comp8320@cs.
Please consider the use of the SCC a privilege, and be mindful it is a
resource shared by researchers as well as fellow students. The SCC can
be considered a `device' in the same sense of a GPU, in that only one
user at a time can (safely) use it (the provided rccerun_h
command attempts to ensure this, but it is not foolproof). Please
observe the following rules:
- use the SCC only for performance measurements, and not debugging.
- access it only via the rccerun_h command; do not
under any circumstances try to log into the cores.
- minimize the run-time of your jobs (e.g. keep under
60s - rccerun_h will enforce
this anyway!).
Finally, note that the availability of the SCC cannot be guaranteed to
the extent of other facilities that you have used so far. Therefore
it is a bad strategy to leave your work to the last moment!
Your Tasks
- Implement in paridamax.c an algorithm to find the index
(and signed value) of the absolute maximum element of a vector
distributed across a number of processes using a block distribution.
You may chose your method, but the more efficient the better. Justify
your strategy in your report. Note while RCCE has some built-in
reductions, it is not obvious how they can be adapted to the problem.
Test the code thoroughly under the RCCE emulator.
- Implement in matmult.c the versions of distributed
matrix multiply, as specified in matmult.h (see also above
description of matmultest). Again test and debug your code
thoroughly under the RCCE emulator before proceeding.
- In your report, write (pseudo-) code for a function to perform
the row-swap operation to the left of the lower panel, e.g.
A(i, 0:j-1) <-> A(pi, 0:j-1)
Assume the function signature is as follows:
RowSwapLeft(int j, int i, int pi, double* A, int ldA)
where A is the pointer to the local storage for the matrix in
the current process. Assume that the grid and process co-ordinates
(px, py, idx and idy) are
available as global variables, as is the
matrix dimension N.
Macros etc from distlinaux.h
may be used.
Also describe how you would modify your code to perform the swaps to
the right of the panel ( A(i, j+w:n-1) <-> A(pi, j+w:n-1)).
- For a largish value of N, say 2000, determine the
scaling behavior of your absolute maximum function (use powers of
2). Tabulate your results. What patterns do you observe? Try to
explain these.
-
Determine value of K that is `near-optimal' on a single SCC
core for a square matrix multiply (using a moderately large value of
N). Take `near optimal' as meaning the smallest K
that is within 5% of the asymptotic performance (say
K=100). Tabulate the results of your experiments. Use this
value of K (unless otherwise specified) in future
experiments.
- Effect of repeated computations.
With a 1 by 8 process grid (i.e. px=8), choose a
suitable value of N and compare the performance of the 3
versions (v=0,1,2) of square matrix multiply for both r=1
and r=10. Try to explain the difference in performance for
the 2 values of r in each case. If possible, devise further
experiments to test your conjectures.
- Effect of process grid aspect ratio.. Choose a suitable
value of p that can be factorized in a number of ways, and a
suitable value of N. Explore the effect of the aspect ratio
on square matrix multiply, ranging from 1 by 1,
2 by p/2, ... p by 1, for the 3
versions. You may constrain the
number of repetitions r = 2*px (this simulates the situation
in the LINPACK computation) - note that you can do this nicely in
bash as in the following example:
px=4; rccerun_h -nue p matmultest -v 2 -x $px -r $((2*px))
N N K
Tabulate your results and explain the differences; in
particular, what aspect is optimal, and why? Hint: consider
the amount of data flowing across individual network links.
- Scaling of matrix multiply. For the optimal (or near
enough) aspect ratio, determine the scalability of matrix multiply
(using the same value of N and K in the previous
question, and chose your favorite value of
v). That is, get results from 1 core to the maximum
number for that ratio (roughly similar aspect ratio may be used for
intermediate results), and tabulate your results.
Calculate the speedups (performance in MFLOPs over
that on a single core) -- this is called strong scaling. Do
likewise for weak scaling: use the value N =
Nm * px / pxm, where
pxm is the value of px for
your largest grid, and Nm refers to the value
of N in the previous paragraph. The results should be quite
different - try to explain them.
- Estimate the FLOP rate of the LINPACK Benchmark would have on its
first `outer iteration', with the N and optimal aspect ratio
above and using NB=K. The rate can be taken as the total
number of FLOPs for the iteration (an exercise in simple maths)
divided by the time it takes for the iteration. The latter can be
broken down into the time it takes for the component
calculations. That for the multiply you have measured; that for
determining the pivots (absolute maximum) for you can determine from
experiments such as from Q4 (noting this is repeated over NB
columns; that for the rank-1 updates you can use a suitable experiment
using matmultest (suggestion: use the average value
for the width N = NB/2, and
r=NB). The time for the triangular matrix
update may be assumed to be the same as this. For the row swaps,
consider only those outside the lower panel; you can determine the
time by the amount of data that must pass through the vertical links
in a column of the network divided by the rate (MB/s) that a link is
capable of. The rate can be taken as the rate to scatter/gather the
matrix from UE 0 when you run matmultest with -h.
All other parts of the computation can be ignored.
Include in
your report the commands used to get these values.
-
The block distribution is sub-optimal in terms of performance
for LINPACK (and indeed a whole class of important block-partitioned
dense matrix computations). Explain why this is (hint:
consider the diagram for the distributed version of LINPACK).
Recalling that we effectively used a block distribution for LINPACK on
OpenMP (c.f. Assignment 1, Q7) and GPUs, explain why this was not a
problem for those situations. From searching the literature (or
otherwise), find what distributions are used for distributed memory
versions of LINPACK and similar computations, and explain why they
are better.
- Based on your experiences in this course, comment on the
programmability (relative ease of designing, implementing and
debugging programs) in the three paradigms you have used so far:
OpenMP (shared memory), CUDA (streaming device programming) and RCCE
(distributed memory). For the latter, take into account the fact that
you had less time to learn, balanced with the fact that your tasks
were more limited, and also the fact you were using a simpler but
sub-optimal distribution. Critically evaluate the claims made in
Kumar et al, The Case For Message Passing On Many-Core Chips.
Your report may be generated in your favorite format (this may include
plain text!) but must be converted to standard PDF format before
submission. The RCCE file(s) should contain your final (best)
versions of your code, although interesting intermediate versions of
code sections could be left there commented out (preferably with
#ifdefs) and/or placed in your report.
Marking Guidelines
Your code must preserve correctness to obtain any marks for the
associated task. Code tidiness, clarity, elegance and parallelization
will be expected. Most marks will be allocated directly to the report.
Quality of writing (grammar, sentence structure, organization) and
clarity of explanations will be expected. Presentation of data
is important. Don't include data/program output that is not relevant to
the question being answered; don't present numbers with a meaningless
number of digits.
In terms of content, the bottom line is to show good and interesting
insights into performance issues for a numerically-intensive
computation on this highly multicore co-processor.
Little or no marks will be given for pasting voluminous and
uninterpreted program output.
In your report, you need not go to a lot of trouble with typesetting
tables and figures. In fact, plain text format will be quite
adequate. However, your report should be tidy and presentable
(including free from spelling and grammatical errors).
Hints
Since similar implementations have already been provided (e.g. in
rank1update.c from Laboratory 8, the main issues will be in
ensuring your have specific ed the parameters to the corresponding RCCE
functions properly. If the program deadlocks (does not terminate after
reasonable time), the error may be due to specifying the wrong process
co-ordinates. The
-p may be useful for small matrices.
Last modified: 25/11/2011, 13:48