The aim of this lab is get up and running on the NCI XE system and give you an introduction to MPI.
To do this lab, you need to have obtained a login ID and password. This will be given to you in the lab (you will need to physically attend one of the two labs).
XE and Vayu are two systems supported by the National Computational Infrastructure program. Staff at Australian Universities are allocated time on this system through a competitive process for use in their research projects. We are extremely fortunate to have been given access to this system for this course. Please use the machine with respect. Note that it is NOT administered by the CS Technical Support Group.
TIP
There is comprehensive documentation for the NCI XE and Vayu systems available HERE You should familiarize yourself with the content. It will be referenced in what follows.
We will start by using the XE system. Log on to the XE system using your user ID: ssh xe.nci.org.au -l <username>
As soon as you have successfully logged in, please change your password.
Then update your user details for your NCI account. Your surname is set to “Account” - do not update it. Please set your first name and mobile phone number. Your position is “Student”.
Each user has a file space quota. CPU time is also limited collectively over the entire group. This means one user can exhaust all the time of the entire group. Thus please monitor your usage of this machine.
Read the pages of the userguide that refer to “Project Accounting”, and “Monitoring Resource Usage”. On the XE execute the following commands and determine what they do:
nf_limits
quotasu -h
quota -v
A tar file containing all the programs for this lab is available HERE. Save this tar file on your local desktop and then transfer it to the XE system. Thus from a terminal window on your desktop and in the directory where you have saved the lab1.tar file execute the scp command (replacing apr659 with your login id on the XE system): scp lab1.tar apr659@xe.nci.org.au:~
then in a terminal window that is logged on to the XE untar the file: tar -xvf lab1.tar
Before we can compile the above programs we have to set our login environment so that it finds the correct compilers and libraries. The NCI systems achieve this using Linux “modules”. Return to the userguide and read the section labelled “Software Environments”. Then run the command module avail.
You can both load and unload a module. For this lab we need to load the “openmpi” module. Do this by executing the command module load openmpi.
Standard UNIX editors are installed including nano, vim and emacs. For graphical editing, you may choose to forward X Windows to your desktop i.e. ssh xe.nci.org.au -l <username> -X.
which will allow you to run emacs as a graphical editor. Alternatively, kate will allow you to edit files over Secure FTP. For example, from your desktop, open the file sftp://<username>@xe.nci.org.au/home/444/<username>/lab1/mpiexample1.c.
mpiexample1.cThis program is just to get started. It looks like:
Note there are 3 basic requirements for all MPI codes:
#include "mpi.h"
MPI_Init( &argc, &argv );
MPI_Finalize();
You can find the header file in /apps/openmpi/1.4.3/include/mpi.h. Take a look at it. It provides the definition of MPI_COMM_WORLD …in a complicated fashion involving another structure that is defined in another routine that forms part of the library (there is no source for this on the XE)….it used to be easier!
MPI_Init and MPI_Finalize should be the first and last executable statements in your code …. basically because it is not clear what happens before or after calls to these functions!! ”man MPI_Init” says:
The MPI Standard does not say what a program can do before an MPI_Init or after an MPI_Finalize. In the Open MPI implementation, it should do as little as possible. In particular, avoid anything that changes the external state of the program, such as opening files, reading standard input, or writing to standard output.
If you want to know what an MPI function does you can:
man MPI_<function> (you need to load the openmpi module first)Note that at the moment we are only interested in MPI1.
Compile the code: make mpiexample1
This will result in:
mpicc -c mpiexample1.c
mpicc -o mpiexample1 mpiexample1.o
mpicc is a wrapper that will end up calling a standard C compiler (in this case gcc). (Do mpicc -v mpiexample1.c to see all the details!). mpicc also ensures that the program links with the mpi library.
Run the code interactively by typing ./mpiexample1.
You should find the executable runs using just one process. With some MPI implementations the code will fail because you have not defined the number of processes to be used. Using openmpi this is done using the command mpirun.
Try running the code interactively again but this time by typing mpirun -np 2 ./mpiexample1.
Now try: mpirun -np 6 ./mpiexample1.
(Try using -np 10, it will fail - why?).
If you run this program enough times you may see that the order in which the output appears changes. Output to stdout is line buffered, but beyond that can appear in any order.
mpirun has a host of different options. Do ”man mpirun” for information. The ”-np” refers to the number of processes that you wish to spawn.
So far we have only been running our code on one of the XE nodes. In total the XE has 156 nodes. One of these is reserved for interactive logins; the remaining nodes are only available via a batch queuing system. Go back to the userguide and read the section entitled “PBS Batch Use”.
Now we will run the same job, but using the PBS batch queuing system. To submit a job to the queuing system we have to write batch script. An example of this is given in file batch_job. Take a look at this. Lines starting with “#PBS” are commands to the queuing system, informing it of how much resources you require and how your job should be executed. We use one of these lines to set the number of processors you want to use. After all this setup information you run the job by issuing the mpirun command, but taking the number of processes from the number of processors allocated by the queuing system.
To submit your job to the queuing system run qsub batch_job.
it will respond with something like
qsub batch_job
303239.xepbs
where 303239.xepbs is the id of the job in the queuing system. To see what is happening on the batch queue, run qstat:
apr659@xe:~/lab1> qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
296307.xepbs hals3ts.ccsdt.j gxg501 103:38:2 R normal
---lots of jobs---
303237.xepbs batch_job apr659 0 Q express
this gives a long list of jobs. In the above the top job is running as indicated by the R in the S column, while my job is queued as indicated by the Q.
Now compare the result of running nqstat.
To delete a job from the queue, run qdel 303237.xepbs.
Make sure you are happy with the above since you will need to use the batch system later.
Modify the code in mpiexample1 to also printout the name of the node each process is executing on. Do this by using the system call:
gethostname(name, sizeof(name));
Throughout the course we will be measuring the elapsed time taken to run our parallel jobs. So we start by assessing how good our various timing routines are.
gettimeofday system
call is provided in walltime.c. Compile and run this,
and from the output estimate the overhead and resolution of
gettimeofday. (If you are not familiar with
gettimeofday, do man gettimeofday.)MPI_Wtime. (Do
man MPI_Wtime.) Insert
extra code to test the resolution of this routine. What do you
estimate the resolution to be? MPI_Wtick do? What value does it
report?In code mpiexample2.c each process allocates an integer buffer of size len(=128integers). Each buffer is initialized to the rank of the process. Process 0 sends its buffer to process 1 and vice versa, i.e. process 0 sends a message of zeros and receives a message of 1s, while process 1 does the opposite.
len=1024. Attempt to run the
code. You should find that it fails to complete. Why? Fix the code so
that if complete for any value of len.mpiexample3.c is a basic pingpong code. Run the code and make sure it works.
len=1 to a maximum message
size of 4*1024*1024 integers for messages of size 4n
(i.e. 1, 4, 16, 64, 256, 1024 etc). Have the code print out the
absolute time and the bandwidth. Are the results what you
expected?MPI_COMM_WORLD for messages of 1, 1024 and
1048576 integers.| Message Size (ints) | time for pingpong between two processes | |
|---|---|---|
| within a node | between two nodes | |
| 1 | ||
| 1024 | ||
| 1048576 | ||