![]() |
ANU College of Engineering and Computer Science
School of Computer Science
|
|
COMP4300/6430 2011: Laboratory 6UNDER DEVELOPMENTGPU Programming with CUDAThe aim of this lab is to introduce you to programming GPUs using CUDA. The material is taken from an NVIDIA tutorialA reminder that comprehensive documentation for the NCI Xe system available HERE. A tar file containing all the programs for this lab is available HERE. Save this tar file on your local desktop and then transfer it to the Xe system as per last lab. Using the GPUs on the XeStart by reading the CUDA section of the guide to Using GPUs on the NCI National Facility Xe. This will tell you what module you need to load in order to use the NVIDIA CUDA compiler and how to access the GPUs via the batch system. Allocating and Using Host and Device MemoryWithin the tar file you will find a directory "cudaMallocAndMemcopy". In this directory there is a program that:
Running a kernel on the GPUNow enter directory "myFirstKernel". This contains a program template to run a kernel on the GPU returning the results to the host for verification. You need to:
Array Reversal: Single Thread BlockNow enter directory "reverseArray". This contains a program template for a code that given an input array {a0,a1,...an-1} in pointer d_a, will store the reversed array {an-1,an-2,...a0} in pointer d_b. Start from the "reverseArray_singleblock" template. Have it launch just one thread block and to reverse an array of size N = numThreads = 256 elements. You need to implement the body of the "reverseArrayBlock()" function and have each thread move a single element to the reversed position. Array Reversal: Multiple Thread BlocksThis time start from the "reverseArray_multiblock" template. For an array of size N, have the program create N/256 thread blocks where each thread block has 256 threads. Note that you now have to compute both the reverse location within the block and the reversed offset to the start of the block.
Understanding OccupancyCompile your reverseArray_multblock program using the -Xptxas="-v" option. This will provide information on the number of registers used per thread. Note this number. The Fermi system has 32768 registers per SM which can support up to a total of 1536 simultaneous threads. That is you can have 1536 threads in flight, but only if their total register usage is less than 32768. Using the number of registers per thread and the number of threads per block what is the maximum number of blocks that can be executing on the SM at any point in time. From this calculate an occupancy. |
|
Please direct all enquiries to: comp4300@cs.anu.edu.au Page authorised by: Head of School, SoCS |
| The Australian National University — CRICOS Provider Number 00120C |