COMP8320
Laboratory 07 - week 7, 2011
Introduction to GPUs and the Xe System
This session will provide an introduction to GPU programming using CUDA,
using GPUs on the NCI National Facility Xe cluster.
Logging in and Setting Up
In this session, your demonstrator will give you an account name,
something like
xxx659 but with other letters than the
x's. This account is for you to use for the rest of the
semester. You will also receive an initial password.
Log in using the command (with the username changed as appropriate):
ssh -Y xxx659@xe.nci.org.au
Note: our course is given a shared account on
the Xe cluster called
c07. This account has a quota on its
usage (CPU hours) of the cluster.
It is extremely important that
no-one expends the quota, even by accident, as this will impact the
whole course! This should not happen if you use the facility in
the way described below.
It is also important that no-one mis-uses
the facility in any way - otherwise our privilege of using it may
be removed.
After logging in, you will need to customize your command line
environment for this course. To do this, simply add to your
~/.cshrc file the line:
source /short/c07/bin/Cshrc
Make sure the line is properly terminated (press the `Enter' key at
the end of the line -- otherwise it won't work!).
Now, you should familiarize yourself with the NF's web pages. Go
first to the home page. Under the
Facilities, Software and
Userguides link, you will see a User Guide
describing the how to user the facility. You should be aware of this,
but do not need to read it now.
Now go to the Accounts
link. At the bottom of this page, click on the
Update Contacts and Researcher Track Record Details link and enter
your details. Put your full name in the First Name field, and
enter your ANU email address and mobile phone number (the latter is the
preferred way to send new passwords). This is important as your
account may get deactivated if you don't do this.
After you have done this, you should change your password
using the password command. Be sure to remember it (if you
do forget, you will have to contact help@nf.nci.org.au).
Finally, copy the files for the session into your home account area:
cp -r /short/c07/lab07/ .
The following editors are available on in the Xe:
Preliminaries
Go to your
lab07 directory and type
make.
This will compile the CUDA matrix multiply program for the GPU.
try running the program:
Unfortunately, there is no GPU on the front-end, and we have to submit
the program to the cluster nodes in batch mode:
runXe is a locally-defined interface to the PBS batch system.
It provides a safe and convenient way of doing so. Usually the Xe is
under-utilized, so small jobs (which is what we will be running!) will
finish quickly, in which case
runXe will display the program's
output and also the output of the batch submission. If not, it will tell
you which files it will place this information in and you will need to
wait for the job to finish (and the files to get created).
There is a command purgeXe which you can use to safely remove
the files generated by runXe. The command:
can be used to delete queued jobs, when
jid is the job number
which you see generated on the first line of the output of
runXe. Jobs that don't terminate are the issue here, as
they can potentially use up our shared quota, unless they are run under
a tight timelimit (which will be the case if they are under
runXe).
Please contact the course lecturer if you want
to submit jobs directly to the PBS batch system.
Matrix Multiply
-
Inspect the CUDA source files and locate the main program, the CUDA
library calls, and the matrix multiply kernel.
The general synopsis of the program is:
runXe matmultest [-v v] [-w W] N K
Run the basic kernel maMult_0(), e.g.
runXe matmultest -v 0 1024 64
Experiment with different N (up to 2048) and K.
Also vary the parameter W (in smaller powers of 2 than the
default 16 - to begin with). Observe the speedups reported by the test
program.
- Implement in the kernel matMult_1() in
matmult.cu the shared memory optimization, as discussed in
lectures. You may assume that K is
a multiple of W.
Test this using the -v 1 option. Note: the
lecture notes code declares matrix date as double; change it to
the FDATA macro (currently defined to be float).
What effect does this have?
- Implement in the kernel matMult_2() in matmult.cu
the shared memory optimization, with the second barrier removed.
This can be done by having two versions of the shared memory arrays
(one for odd k, one for even).
Does this have a noticeable effect on performance?
- (optional)
Implement in the kernel matMult_3() in matmult.cu
the shared memory optimization, using prefetching, i.e. you load into
the shared memory array one iteration in advance.
This again can be done by having two versions of the shared memory arrays.
Can you get away with a single barrier here as well?
What effect does this have on performance?
- (optional) You could also try other optimizations,
e.g. unrolling the k loop, having each thread operate on a 2
by 2 block instead of a single element (for this, you may assume that
N is always even). This could go in the kernel
matMult_4(). Note: if you do the 2 by 2 block, you
might need to set up the call in matmultest.cu as follows:
dim3 block(W , W);
int W2 = 2*W;
dim3 grid((N+W2-1)/W2, (N+W2-1)/W2);
matMult_4 <<<grid, block>>> (N, K, A_d, B_d, C_d);
- (optional)
You can also try double versions of the matrix multiply.
To do this, type make clean to purge old files and
change the #if 0 to an #if 1 in mat.h.
You will probably find it is noticeably slower.
- Using -v 5 invokes the CUDA BLAS matrix multiply
kernels. This indicates what kind of performance the S2050 GPU is
capable of!
Further Exercises
For those wanting further exercises in CUDA programming (and have not
done COMP4300), try
COMP4300 Lab 6. Use
runXe to submit your GPU runs as
before. The codes test themselves for correctness but do not give
timing information.