CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
School of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP8320 Laboratory 08 - week 10, 2011

Introduction to the RCCE Emulation Environment and the Intel SCC


This session will provide an introduction to distributed memory programming via the RCCE in emulation mode. We will use the CS student system.

Setting Up

Copy the files for the session into your home account area:
    cp -r /dept/dcs/comp8320/public/lab08/ .
Go to your lab08 directory and type make. This will compile the dsum and rank1utest programs. Try running the first program with 4 processes:
    rccerun_h -nue 4 ./dsum 10000
Note that the RCCE manual pages are on-line on the CS system, e.g. man RCCE_send. The names of the RCCE functions available are in /dept/dcs/comp8320/include/RCCE.h. The RCCE Specification document is available from here.

Reductions

  1. The general usage for the dsum program is:
      rccerun_h -nue p ./dsum [-v v] N
    Inspect the code; you will see that it uses the RCCE reduction routine (-v 0) and a fan-in/fan-out method (all processes send their partial sum to process 0). How does the code complexity compare with the version of dsum.c from Laboratory 1?

  2. Try running the code for version 1 and 2. You can verify the answer produced is correct by comparing for p=1. Observe the timings - there may not be much difference in emulation mode, but which would you expect to be faster on the actual SCC?

  3. Implement for -v 2 the tree based reduction method, as discussed in the tutorial. Test your code for various p. How would you expect its performance to compare on the real SCC?

    The pseudocode for the algorithm for process id of p processes is:

    for (d=1; d < p; d*=2) {
      if (id % (2*d) == d) // send partial sum 
        send(..., id - d);
      if (id % (2*d) == 0  && id + d < p) 
        recv(..., id + d);
        // add received value to partial sum    
    } 
    
    This leaves the total sum in process 0. The code to broadcast it back is analogous:
    for (d=p/2; d > 0; d/=2) {
      if (id % (2*d) == 0  && id + d < p) 
        send(..., id + d); // global sum
      if (id % (2*d) == d) 
        recv(..., id - d); // global sum 
    } 
    
    If p is not a power of 2, the previous power of 2 < p should be used to initialize d.

Rank-1 Update

  1. The general usage for the rank1utest program is:
      rccerun_h -nue p ./rank1utest [-p] [-v v] [-x px] [-r r] M N
    This does a rank-1 update computation on a matrix distributed across a py by px process grid, where py = p / px.

    Note:: RCCE in emulation mode is implemented via OMP threads. Therefore, programs must be thread-sage. This in turn means that use of global variables should be avoided and hence getopt() is unsafe. In turn, that means that this program only accepts options in the above order, unlike previous test programs.

  2. Inspect the code. The program initializes data on process (0,0), which scatters it out to the others. For the rank-1 update, the program must broadcast the distributed vector x (y) across the process rows (columns), and perform a local rank-1 update on its portion of the matrix. Finally, the distributed matrix is gathered back to process (0,0).

  3. The program works for px = p (the default case). Add in code to broadcast y down process columns so that it works for any process grid. The code is analogous to the code used to broadcast x; note the commented out print statements which may be useful for debugging.

When you get (finally) access to the SCC!

Compare the performance of these different versions (use reasonably large matrix size). Are they what you expected? For the rank-1 update computation, use -r 100 and compare. What do you notice?
Copyright | Disclaimer | Privacy | Contact ANU