CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
School of Computer Science
Printer Friendly Version of this Document

UniSAFE

Parallel Systems

COMP4300/6430 2011: Laboratory 6

UNDER DEVELOPMENT

GPU Programming with CUDA

The aim of this lab is to introduce you to programming GPUs using CUDA. The material is taken from an NVIDIA tutorial
This lab will again use the NCI Xe system. You should all have accounts on this system by now. If not you should email comp4300@cs.anu.edu.au well or if you are in the lab you will need to talk with the tutor.

A reminder that comprehensive documentation for the NCI Xe system available HERE.

A tar file containing all the programs for this lab is available HERE. Save this tar file on your local desktop and then transfer it to the Xe system as per last lab.


Using the GPUs on the Xe


Start by reading the CUDA section of the guide to Using GPUs on the NCI National Facility Xe. This will tell you what module you need to load in order to use the NVIDIA CUDA compiler and how to access the GPUs via the batch system.

Allocating and Using Host and Device Memory


Within the tar file you will find a directory "cudaMallocAndMemcopy". In this directory there is a program that:
  • Allocates memory for pointers d_a and d_b on the device
  • Copies data associated with h_a on the host to d_a on the device
  • Does a device to device copy from d_a to d_b
  • Copies d_b on the device back to h_a on the host
  • Frees d_a and d_b on the host
In the template provided none of the arguments to the CUDA memory allocation and copy routines are complete. Fix this so that the program compiles and executes correctly.

Running a kernel on the GPU


Now enter directory "myFirstKernel". This contains a program template to run a kernel on the GPU returning the results to the host for verification. You need to:
  • Allocate device memory for the result of the kernel using pointer d_a
  • Configure and launch the kernel using a 1-D grid of 1-D thread blocks
  • Have each thread set an element of d_a as follows:
    idx = blockIdx.x*blockDim.x + threadIdx.x
    d_a[idx] = 1000*blockIdx.x + threadIdx.x
    
  • Copy the result in d_a back to the host pointer h_a
  • Verify that the result is correct
In the template the argument to the kernel function and its content are missing, the malloc calls need arguments, the number of threads and the number of blocks used in the kernel initiation are missing, the testing code is incomplete. Add all these things so that the code compiles and executes correctly.

Array Reversal: Single Thread Block


Now enter directory "reverseArray". This contains a program template for a code that given an input array {a0,a1,...an-1} in pointer d_a, will store the reversed array {an-1,an-2,...a0} in pointer d_b.

Start from the "reverseArray_singleblock" template. Have it launch just one thread block and to reverse an array of size N = numThreads = 256 elements.

You need to implement the body of the "reverseArrayBlock()" function and have each thread move a single element to the reversed position.


Array Reversal: Multiple Thread Blocks


This time start from the "reverseArray_multiblock" template. For an array of size N, have the program create N/256 thread blocks where each thread block has 256 threads. Note that you now have to compute both the reverse location within the block and the reversed offset to the start of the block.


Understanding Occupancy


Compile your reverseArray_multblock program using the -Xptxas="-v" option. This will provide information on the number of registers used per thread. Note this number.

The Fermi system has 32768 registers per SM which can support up to a total of 1536 simultaneous threads. That is you can have 1536 threads in flight, but only if their total register usage is less than 32768.

Using the number of registers per thread and the number of threads per block what is the maximum number of blocks that can be executing on the SM at any point in time. From this calculate an occupancy.