CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science (CECS)
Department of Computer Science
Printer Friendly Version of this Document
High Performance Scientific Computing COMP4300
COMP4300: Laboratory 1

COMP4300 2009: Laboratory 1

Introduction to MPI on the Saratoga Cluster

The aim of this lab is get up and running on the Saratoga Cluster and give you an introduction to MPI.

Saratoga Login IDs

First Name       Surname          Login ID
Simon            Bragg            c43sxb
Markus           Brenner          c43mxb
Steven           Chang            c43sxc
Michael          Chapman          c43mxc
Sumedha          De Silva         c43sxd
Sotirios         Diamand          c43syd
Zi               Dong             c43zxd
Artyom           Dziouba          c43axd
Mathew           Ellis            c43mxe
Christopher      Fraser           c43cxf
Stephen          Gream            c43sxg
Paul             Hartlipp         c43pxh
Omar             Hashmi           c43oxh
John             Haynes           c43jxh
Jie              Hua              c43jzh
Yue              Huang            c43yxh
Jonathon         Hunklinger       c43jyh
Michael          Karas            c43mxk
Paul             Krix             c43pxk
Yuanpeng         Li               c43yxl
Xiang            Li               c43xxl
Wilson           Ly               c43wxl
Swe              Lynn             c43sxl
Francis          Markham          c43fxm
Briely           Marum            c43bxm
Timothy          Mathas           c43txm
Gaurav           Mitra            c43gxm
Ian              Munsie           c43ixm
Sheharyar        Naeem            c43sxn
Minh             Nguyen           c43mxn
Kevin            O'Shea           c43kxo
Alexander        Osborne          c43axo
Christopher      Pelling          c43cxp
Peter            Penglis          c43pxp
Joel             Plane            c43jxp
Benjamin         Polkinghorne     c43bxp
Matthew          Rankin           c43mxr
James            Richards         c43jxr
Huw              Rowlands         c43hxr
Ian              Roxburgh         c43ixr
Matthew          Scott            c43mxs
Travis           Stenborg         c43txs
Richard          Thomas           c43rxt
James            Thomson          c43jxt
Khoi-Nguyen      Tran             c43kxt
Temi             Varghese         c43txv
Naveen           Venugopal        c43nxv
Andrew           Waldron          c43axw
Danny            Wang             c43dxw
Guo              Wang             c43gxw
Ming-Lun         Wen              c43mxw
Nick             Withers          c43nxw
Ji               Wong             c43jxw
Yat              Yiu              c43yxy
Li               Zhou             c43lxz
YOU WILL NEED TO ATTEND THE LAB AND ASK ME FOR THE PASSWORD. Log on to the machine via
   ssh saratoga -l c43xxx
Saratoga is a resource within the Computer Systems research group, so it is run an administered by the group (not the Technical Support Group). As a consequence, please use the machine with respect. If you have problems you will need to see myself, Rui Yang or Ganesh Venkateshwara (both in room N232).

Be aware - THERE ARE NO BACKUPS ON SARATOGA - it is your responsibility to periodically move your files back to the student system. File space is also tight. You have a quota of 200MB, but there is insufficient space on the disk for everyone to use this. So please clean up periodically.

Saratoga is the front end in a Beowulf cluster, providing a bridge between the outside world and the actual cluster. Saratoga is a 2GHz single CPU opteron processor (cat /proc/cpuinfo). The actual cluster has 7 working nodes (1 is not working!) that are (imaginatively) named node00, node02:node07 (guess which one is dead!). Each of these nodes is a 2.2GHz dual core Athlon, so in total you can run using 14 cores over 7 nodes.

While you can log on to the nodes of the cluster (do "ssh node00" from saratoga), you will not normally do this. Rather you will submit jobs to the cluster using a queuing system. We will use Sun N1 Grid Engine (more on this later).

On Saratoga you should be able to access emacs, kate and vi/vim to edit your files.

Example Programs

A tar file containing all the programs for this lab is available on Saratoga in /tmp/COMP4300. Obtain this and untar it as follows:
  cp /tmp/COMP4300/lab1.tar .
  tar -xvf lab1.tar

mpiexample1.c


This program is just to get started. It looks like:
#include  
#include "mpi.h" 
 
int main( argc, argv ) 
int  argc; 
char **argv; 
{ 
    int rank, size; 
    MPI_Init( &argc, &argv ); 
    MPI_Comm_size( MPI_COMM_WORLD, &size ); 
    MPI_Comm_rank( MPI_COMM_WORLD, &rank ); 
    printf( "Hello world from process %d of %d\n", rank, size ); 
    MPI_Finalize(); 
    return 0; 
} 

Note there are 3 basic requirements for ALL MPI codes
    #include "mpi.h"
    MPI_Init( &argc, &argv ); 
    MPI_Finalize(); 
You can find the header file in /opt/cluster/mpich_127/include/mpi.h. Take a look at it. It provides the definition of MPI_COMM_WORLD - what integer value does this take?

MPI_Init and MPI_Finalize should be the first and last executable statements in your code .... basically because it is not clear what happens before or after calls to these functions!! "man MPI_Init" says:

The MPI standard does not say what a program can do before an MPI_INIT or after an MPI_FINALIZE. In the MPICH implementation, you should do as little as possible. In particular, avoid anything that changes the external state of the program, such as opening files, reading standard input or writing to standard output.


If you want to know what an MPI function does you can: Noting that at the moment we are only interested in MPI1.
Compile the code
  make mpiexample1
This will result in
/opt/cluster/mpich_127/bin/mpicc  -c mpiexample1.c    
/opt/cluster/mpich_127/bin/mpicc -o mpiexample1 mpiexample1.o 
mpicc is a wrapper that will end up calling a standard C compiler (in this case gcc). (Do mpicc -v mpiexample1.c to see all the details!). mpicc also ensures that the program links with the mpi library.

Run the code interactively by typing

  
   mpiexample1
You should find the executable runs but using just one process. With some MPI implementations the code will fail because you have not defined the number of processes to be used. Using MPICH this is done using the command mpirun.

Try running the code interactively again but this time by typing

  mpirun -np 2 mpiexample1
Now try
  
  mpirun -np 6 mpiexample1
(Don't set -np to anything over 10).

If you run this program enough times you may see that the order in which the output appears changes. Output to stdout is line buffered, but beyond that can appear in any order.

mpirun has a host of different options. Do "man mpirun" for information. The "-np" refers to the number of processes that you wish to spawn.

Now we will run the same job, but using the batch queuing system. To submit a job to the queuing system we have to write batch script. An example of this is given in file batch_job. Take a look at this. Lines starting with "#$" are commands to the queuing system, informing it of how much resources you require and how your job should be executed. We use one of these lines to set the number of processors you want to use. After all this setup information you run the job by issuing the mpirun command, but taking the number of processes from the number of processors allocated by the queuing system.

To submit your job to the queuing system do

qsub batch_job
it will respond with something like
qsub batch_job
Your job 84 ("mpich_job") has been submitted
where 84 is the id of the job in the queuing system. To see what is happening on the batch queue do
c43tut@saratoga:~/lab1> qstat
job-ID  prior   name       user state submit/start at queue slots ja-task-ID 
-------------------------------------------------------------------------------
   86 0.55500 mpich_job  c43tut r     03/07/2009 10:46:45  all.q@node04  8
   87 0.55500 mpich_job  c43tut qw    03/07/2009 10:46:37   
this shows two jobs, one running (status r) and one waiting. The running job is using 8 slots.

To delete a job from the queue, do

qdel 86
c43tut has deleted job 86

The queuing system has a total of 14 executing slots (this is determined at setup, and is set to 14 because we have 7 nodes each with a dual core processor). If you submit a job that uses all 14 slots, no other user will be able to run until your job finishes. In this respect the system is said to be space sharing, rather than time sharing.

Make sure you are happy with all of the above.


Exercise 1


Modify the code in mpiexample1 to also printout the name of the node each process is executing on. Do this by using the system call
  gethostname(name, sizeof(name));
  1. Run your modified version of mpiexample1 interactively. What nodes of the cluster are being used?
  2. Repeat the above, but now use the batch file. What nodes are now being used? If there is a difference, how is this difference set?
  3. Noting that each node in our cluster can support two processes, how does the queuing system map processes to nodes? For example if we run a 4 process MPI job does it only use two or more physical nodes? And how are the MPI process IDs (ranks) mapped to physical nodes.

Exercise 2


Throughout the course we will be measuring the elapsed time taken to run our parallel jobs. So we start by assessing how good our various timing routines are.
  1. What is the difference between timer overhead and timer resolution?
  2. We can assess the overhead and resolution of a timer by calling it twice in quick succession, printing the difference, and repeating this whole process many times. Why is this? (This was not covered in lectures, but was in COMP3320 if you took that course. If you don't know ask me, ...or someone who attended COMP3320.)
  3. Code that does the above for the gettimeofday system call is provided in walltime.c. Compile and run this, and from the output estimate the overhead and latency of gettimeofday. (If you are not familiar with gettimeofday, do man gettimeofday.)
  4. Is the resolution of the timer on the remote nodes the same as that on Saratoga?
  5. MPI provides its own timing routine, MPI_Wtime. (Do man MPI_Wtime.) Insert extra code to test the resolution of this routine. What do you estimate the resolution to be?
  6. What does the function MPI_Wtick do? What value does it report?
Compile the code via "make walltime". You can run the code interactively as it only takes a short time.

Exercise 3

In code mpiexample2.c each process allocates an integer buffer of size len (=1024integers). Each buffer is initialized to the rank of the process. Process 0 sends its buffer to process 1 and vice versa, i.e. process 0 sends a message of zeros and receives a message of 1s, while process 1 does the opposite.
  1. The code runs correctly with len=1024 but never completes if you change this to len=50*1024*1024. (Try this but don't let it run for too long!). Explain why this is happens. How you would fix the code so it works correctly in both cases?

Exercise 4

mpiexample3.c is a basic pingpong code. Run the code and make sure it works.
  1. Currently the code only does pingpong between 0 and 1 for a message containing a single integer and measures the time using MPI_Wtime. Modify the code so that it runs for len=1 to a maximum message size of 4*1024*1024 integers for messages of size 4n (i.e. 1, 4, 16, 64, 256 etc). Have the code print out the absolute time and the bandwidth. Which timer did you use, MPI_Wtime or gettimeofday? Is the resolution sufficient?
  2. What latency do you measure on the cluster? What is the peak bandwidth? How does the bandwidth change with message length?
  3. Further modify the code so that it measures the pingpong time between process 0 and all other processes in MPI_COMM_WORLD for messages of 1, 1024 and 1048576 integers.
  4. Run your code on the batch system to complete the following table
      -----------------------------------------------------------
        Message   ----time for pingpong between two processes----
      Size(ints)      within_a_node      between_two_nodes
      ------------------------------------------------------------
               1
            1024 
         1048576 
      --------------------------------------------------------
    
  5. What results did you expect to see? Are the results in line with these expectations? If not why not?

Exercise 5


This is a tricky question! mpiexample4.c uses a binary tree to perform a basic broadcast.
  1. Change the code to perform a global sum of the vector mbuf over all processors. (Note - you will need to initialize the content of mbuf on all processes - not just rank 0 as is the case now. It is best to initiate the values on each process to be different e.g. set values to be rank. Also it might be wise to actually have len greater than 1.) You should use a binary tree algorithm. The result should only be present on process 0.
  2. Modify the above code so that the id of the "root process" is passed as an argument to the reduction routine. (The root process is the one that receives the final result). Verify that your code works correctly even when the root process is not process 0. (What you have written is now roughly eqivalent to MPI_Reduce - except that your routine is only performing a summation and restricted to integers.)
  3. Now modify your reduction code so that the result is present on all processes (there should no longer be a root process). Your new routine should now be roughly equivalent to MPI_AllReduce.