CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
School of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP8320 Laboratory 07 - week 7, 2011

Introduction to GPUs and the Xe System


This session will provide an introduction to GPU programming using CUDA, using GPUs on the NCI National Facility Xe cluster.

Logging in and Setting Up

In this session, your demonstrator will give you an account name, something like xxx659 but with other letters than the x's. This account is for you to use for the rest of the semester. You will also receive an initial password.

Log in using the command (with the username changed as appropriate):

    ssh -Y xxx659@xe.nci.org.au
Note: our course is given a shared account on the Xe cluster called c07. This account has a quota on its usage (CPU hours) of the cluster. It is extremely important that no-one expends the quota, even by accident, as this will impact the whole course! This should not happen if you use the facility in the way described below. It is also important that no-one mis-uses the facility in any way - otherwise our privilege of using it may be removed.

After logging in, you will need to customize your command line environment for this course. To do this, simply add to your ~/.cshrc file the line:

    source /short/c07/bin/Cshrc
Make sure the line is properly terminated (press the `Enter' key at the end of the line -- otherwise it won't work!).

Now, you should familiarize yourself with the NF's web pages. Go first to the home page. Under the Facilities, Software and Userguides link, you will see a User Guide describing the how to user the facility. You should be aware of this, but do not need to read it now.

Now go to the Accounts link. At the bottom of this page, click on the Update Contacts and Researcher Track Record Details link and enter your details. Put your full name in the First Name field, and enter your ANU email address and mobile phone number (the latter is the preferred way to send new passwords). This is important as your account may get deactivated if you don't do this.

After you have done this, you should change your password using the password command. Be sure to remember it (if you do forget, you will have to contact help@nf.nci.org.au).

Finally, copy the files for the session into your home account area:

    cp -r /short/c07/lab07/ .
The following editors are available on in the Xe:
    emacs, vi

Preliminaries

Go to your lab07 directory and type make. This will compile the CUDA matrix multiply program for the GPU. try running the program:
    matmultest 16 4
Unfortunately, there is no GPU on the front-end, and we have to submit the program to the cluster nodes in batch mode:
    runXe matmultest 16 4
runXe is a locally-defined interface to the PBS batch system. It provides a safe and convenient way of doing so. Usually the Xe is under-utilized, so small jobs (which is what we will be running!) will finish quickly, in which case runXe will display the program's output and also the output of the batch submission. If not, it will tell you which files it will place this information in and you will need to wait for the job to finish (and the files to get created).

There is a command purgeXe which you can use to safely remove the files generated by runXe. The command:

    qdel jid
can be used to delete queued jobs, when jid is the job number which you see generated on the first line of the output of runXe. Jobs that don't terminate are the issue here, as they can potentially use up our shared quota, unless they are run under a tight timelimit (which will be the case if they are under runXe). Please contact the course lecturer if you want to submit jobs directly to the PBS batch system.

Matrix Multiply

  1. Inspect the CUDA source files and locate the main program, the CUDA library calls, and the matrix multiply kernel. The general synopsis of the program is:
      runXe matmultest [-v v] [-w W] N K
    Run the basic kernel maMult_0(), e.g.
      runXe matmultest -v 0 1024 64
    Experiment with different N (up to 2048) and K. Also vary the parameter W (in smaller powers of 2 than the default 16 - to begin with). Observe the speedups reported by the test program.

  2. Implement in the kernel matMult_1() in matmult.cu the shared memory optimization, as discussed in lectures. You may assume that K is a multiple of W. Test this using the -v 1 option. Note: the lecture notes code declares matrix date as double; change it to the FDATA macro (currently defined to be float). What effect does this have?

  3. Implement in the kernel matMult_2() in matmult.cu the shared memory optimization, with the second barrier removed. This can be done by having two versions of the shared memory arrays (one for odd k, one for even). Does this have a noticeable effect on performance?

  4. (optional) Implement in the kernel matMult_3() in matmult.cu the shared memory optimization, using prefetching, i.e. you load into the shared memory array one iteration in advance. This again can be done by having two versions of the shared memory arrays. Can you get away with a single barrier here as well? What effect does this have on performance?

  5. (optional) You could also try other optimizations, e.g. unrolling the k loop, having each thread operate on a 2 by 2 block instead of a single element (for this, you may assume that N is always even). This could go in the kernel matMult_4(). Note: if you do the 2 by 2 block, you might need to set up the call in matmultest.cu as follows:
      dim3 block(W , W);
      int W2 = 2*W;
      dim3 grid((N+W2-1)/W2, (N+W2-1)/W2);
      matMult_4 <<<grid, block>>> (N, K, A_d, B_d, C_d);

  6. (optional) You can also try double versions of the matrix multiply. To do this, type make clean to purge old files and change the #if 0 to an #if 1 in mat.h. You will probably find it is noticeably slower.

  7. Using -v 5 invokes the CUDA BLAS matrix multiply kernels. This indicates what kind of performance the S2050 GPU is capable of!

Further Exercises

For those wanting further exercises in CUDA programming (and have not done COMP4300), try COMP4300 Lab 6. Use runXe to submit your GPU runs as before. The codes test themselves for correctness but do not give timing information.

Copyright | Disclaimer | Privacy | Contact ANU