COMP8320
Laboratory 08 - week 10, 2011
Introduction to the RCCE Emulation Environment
and the Intel SCC
This session will provide an introduction to distributed memory programming
via the RCCE in emulation mode. We will use the CS student system.
Setting Up
Copy the files for the session into your home account area:
cp -r /dept/dcs/comp8320/public/lab08/ .
Go to your
lab08 directory and type
make.
This will compile the
dsum and
rank1utest
programs.
Try running the first program with 4 processes:
rccerun_h -nue 4 ./dsum 10000
Note that the RCCE manual pages are on-line on the CS system,
e.g.
man RCCE_send. The names of the RCCE functions available are in
/dept/dcs/comp8320/include/RCCE.h.
The RCCE Specification document is available from
here.
Reductions
- The general usage for the dsum program is:
rccerun_h -nue p ./dsum [-v v] N
Inspect the code; you will see that it uses the RCCE reduction routine
(-v 0) and a fan-in/fan-out method (all processes send their
partial sum to process 0). How does the code complexity compare with the
version of dsum.c from Laboratory 1?
- Try running the code for version 1 and 2. You can verify the answer
produced is correct by comparing for p=1. Observe the timings -
there may not be much difference in emulation mode, but which would you
expect to be faster on the actual SCC?
- Implement for -v 2 the tree based reduction method,
as discussed in the tutorial. Test your code for various p.
How would you expect its performance to compare on the real SCC?
The pseudocode for the algorithm for process id of p
processes is:
for (d=1; d < p; d*=2) {
if (id % (2*d) == d) // send partial sum
send(..., id - d);
if (id % (2*d) == 0 && id + d < p)
recv(..., id + d);
// add received value to partial sum
}
This leaves the total sum in process 0. The code to broadcast it back is
analogous:
for (d=p/2; d > 0; d/=2) {
if (id % (2*d) == 0 && id + d < p)
send(..., id + d); // global sum
if (id % (2*d) == d)
recv(..., id - d); // global sum
}
If p is not a power of 2, the previous power of 2 < p
should be used to initialize d.
Rank-1 Update
- The general usage for the rank1utest program is:
rccerun_h -nue p ./rank1utest [-p]
[-v v] [-x px] [-r r] M N
This does a rank-1 update computation on a matrix distributed across a
py by px process grid, where py = p / px.
Note:: RCCE in emulation mode is implemented via OMP threads.
Therefore, programs must be thread-sage. This in turn means that use of
global variables should be avoided and hence getopt() is unsafe.
In turn, that means that this program only accepts options in the above
order, unlike previous test programs.
- Inspect the code. The program initializes data on process (0,0),
which scatters it out to the others.
For the rank-1 update, the program must broadcast the distributed
vector x (y) across the process rows (columns), and
perform a local rank-1 update on its portion of the matrix.
Finally, the distributed matrix is gathered back to process (0,0).
- The program works for
px = p (the default case). Add in code to broadcast
y down process columns so that it works for any process
grid. The code is analogous to the code used to broadcast x;
note the commented out print statements which may be useful for debugging.
When you get (finally) access to the SCC!
Compare the performance of these different versions (use reasonably large
matrix size). Are they what you expected? For the rank-1 update computation,
use
-r 100 and compare. What do you notice?