COMP4300/6430: Laboratory 1
COMP4300/6430 2011: Laboratory 1
Introduction to the NCI Xe System and MPI
DRAFT
The aim of this lab is get up and running on the NCI Xe system
and give you an introduction to MPI.
To do this lab, you need to have obtained a login ID and password by
following the instructions on the main lab web page.
The Xe is one of the systems supported by the National Computational
Infrastructure program. Staff at Australian Universities are allocated
time on this system through a competitive process for use in their
research projects. We are extremely fortunate to have been given
access to this system for this course. Please use the machine with
respect. Note that it is NOT administered by the CS Technical Support
Group.
There is comprehensive documentation for the NCI Xe system available
HERE.
You should familiarize yourself with the content. It will be
referenced in what follows.
Log on to the xe system using your user ID
ssh xe.nci.org.au -l apr659
Each user has a file space quota. CPU time is also limited
collectively over the entire group. This means one user can exhaust
all the time of the entire group. Thus please monitor your usage of
this machine.
Read the pages of the userguide that refer to "Project Accounting", and
"Monitoring Resource Usage". On the Xe execute the following commands
and determine what they do
nf_limits
quotasu -h
quota -v
Example Programs
A tar file containing all the programs for this lab is available
HERE. Save this tar file on your local desktop and then transfer it to
the Xe system. Thus from a terminal window on your desktop and in the
directory where you have saved the lab1.tar file execute the scp
command (replacing apr659 with your login id on the xe system).
scp lab1.tar apr659@xe.nci.org.au:~
then in a terminal window that is logged on to the Xe untar the file
tar -xvf lab1.tar
Modules
Before we can compile the above programs we have to set our login
environment so that it finds the correct compilers and libraries. The
NCI systems achieve this using Linux "modules". Return to the
userguide and read the section labelled "Software Environments". Then
execute the command
module avail
You can both load and unload a module. For this lab we need to load
the "openmpi" module. Do this by executing the command
module load openmpi
mpiexample1.c
This program is just to get started. It looks like:
#include
#include "mpi.h"
int main( argc, argv )
int argc;
char **argv;
{
int rank, size;
MPI_Init( &argc, &argv );
MPI_Comm_size( MPI_COMM_WORLD, &size );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
printf( "Hello world from process %d of %d\n", rank, size );
MPI_Finalize();
return 0;
}
Note there are 3 basic requirements for ALL MPI codes
#include "mpi.h"
MPI_Init( &argc, &argv );
MPI_Finalize();
You can find the header file in
/apps/openmpi/1.4.3/include/mpi.h. Take a look at it.
It provides the definition of
MPI_COMM_WORLD ...in a complicated fashion involving another structure that is defined in another routine that forms part of the library (there is no source for this on the xe)....it used to be easier!
MPI_Init and MPI_Finalize should be the first and
last executable statements in your code .... basically because it is not clear
what happens before or after calls to these functions!! "man
MPI_Init" says:
The MPI Standard does not say what a program can do before an MPI_Init
or after an MPI_Finalize. In the Open MPI implementation, it should do
as little as possible. In particular, avoid anything that changes the
external state of the program, such as opening files, reading standard
input, or writing to standard output.
If you want to know what an MPI function does you can:
- do "man MPI_function" (you need to load the openmpi
module first)
- Look at the
MPI1 standard
- Look at the on line
MPI1 book
- Ask your tutor!
Note that at the moment we are only interested in MPI1.
Compile the code
make mpiexample1
This will result in
mpicc -c mpiexample1.c
mpicc -o mpiexample1 mpiexample1.o
mpicc is a wrapper that will end up calling a standard C
compiler (in this case gcc). (Do
mpicc -v mpiexample1.c to
see all the details!).
mpicc also ensures that the program
links with the
mpi library.
Run the code interactively by typing
mpiexample1
You should find the executable runs using just one process. With
some MPI implementations the code will fail because you have not
defined the number of processes to be used. Using openmpi this is
done using the command
mpirun.
Try running the code interactively again but this time by typing
mpirun -np 2 ./mpiexample1
Now try
mpirun -np 6 ./mpiexample1
(Don't set -np to anything over 10).
If you run this program enough times you may see that the order in
which the output appears changes. Output to stdout is line buffered,
but beyond that can appear in any order.
mpirun has a host of different options. Do "man mpirun"
for information. The "-np" refers to the number of processes
that you wish to spawn.
So far we have only been running our code on one of the Xe nodes. In
total the Xe has 156 nodes. One of these is reserved for interactive
logins; the remaining nodes are only available via a batch
queuing system. Go back to the userguide and read the section entitled
"PBS Batch Use".
Now we will run the same job, but using the PBS batch queuing system. To
submit a job to the queuing system we have to write batch script.
An example of this is given in file batch_job. Take a look
at this. Lines starting with "#PBS" are commands to the queuing system,
informing it of how much resources you require and how your job
should be executed. We use one of these lines to set the number of
processors you want to use. After all this setup information you run
the job by issuing the mpirun command, but taking the number of
processes from the number of processors allocated by the queuing
system.
To submit your job to the queuing system do
qsub batch_job
it will respond with something like
qsub batch_job
303239.xepbs
where 303239.xepbs is the id of the job in the queuing system. To see
what is happening on the batch queue do
c43tut@saratoga:~/lab1> qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
296307.xepbs hals3ts.ccsdt.j gxg501 103:38:2 R normal
---lots of jobs---
303237.xepbs batch_job apr659 0 Q express
this gives a long list of jobs. In the above the top job is running as
indicated by the R in the S column, while my job is queued as
indicated by the Q.
To delete a job from the queue, do
qdel 303237.xepbs
Make sure you are happy with the above since you will need to use the
batch system later.
Exercise 1
Modify the code in mpiexample1 to also printout the name of the
node each process is executing on. Do this by using the system call
gethostname(name, sizeof(name));
- Run your modified version of mpiexample1 interactively. What
nodes of the cluster are being used?
- Repeat the above, but now use the batch file. What nodes are now
being used?
- Modify the batch script so that your MPI code runs on at least
two different nodes of the Xe cluster.
Exercise 2
Throughout the course we will be measuring the elapsed time taken to
run our parallel jobs. So we start by assessing how good our various
timing routines are.
- What is the difference between timer overhead and timer resolution?
- We can assess the overhead and resolution of a timer by calling
it twice in quick succession, printing the difference, and repeating
this whole process many times. Why is this? (See lecture 5)
- Code that does the above for the gettimeofday system
call is provided in walltime.c. Compile and run this,
and from the output estimate the overhead and latency of
gettimeofday. (If you are not familiar with
gettimeofday, do man gettimeofday.)
- MPI provides its own timing routine, MPI_Wtime. (Do
man MPI_Wtime.) Insert
extra code to test the resolution of this routine. What do you
estimate the resolution to be?
- What does the function MPI_Wtick do? What value does it
report?
Exercise 3
In code
mpiexample2.c each process allocates an integer
buffer of size
len (=128integers). Each buffer is
initialized to the rank of the process. Process 0 sends its buffer to
process 1 and vice versa, i.e. process 0 sends a message of zeros and
receives a message of 1s, while process 1 does the opposite.
- Compile and run the code interactively using two
processes. Verify that it works as you expect.
- Now change the code so that len=1024. Attempt to run the
code. You should find that it fails to complete. Why? Fix the code so
that if complete for any value of len.
Exercise 4
mpiexample3.c is a basic pingpong code. Run the code and make
sure it works.
- Currently the code only does pingpong between 0 and 1 for
a message containing 64 integers and measures the time
using MPI_Wtime. Modify the code so that it
runs for len=1 to a maximum message
size of 4*1024*1024 integers for messages of size 4n
(i.e. 1, 4, 16, 64, 256, 1024 etc). Have the code print out the
absolute time and the bandwidth. Are the results what you
expected?
- What latency did you measure and what
peak bandwidth? How does the bandwidth change with message
length?
- Further modify the code so that it measures the pingpong
time between process 0 and all other processes in
MPI_COMM_WORLD for messages of 1, 1024 and
1048576 integers.
- Run your code on the batch system using 16 CPUs and complete the
following table
-----------------------------------------------------------
Message ----time for pingpong between two processes----
Size(ints) within_a_node between_two_nodes
------------------------------------------------------------
1
1024
1048576
--------------------------------------------------------
- What results did you expect to see? Are the results in line
with these expectations? If not why not?