![]() |
ANU College of Engineering and Computer Science
School of Computer Science
|
|
|
COMP8320 Multicore Computing Assignment 2, 2011
CUDA LINPACK Benchmark on The Fermi GPU
Deadline: 17:00 on
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| How Late | Penalty from 60 Marks |
|---|---|
| less than 1 hour | -1 |
| between 1 and 6 hour | -2 |
| between 6 and 24 hours (1 day) | -4 |
| between 24 and 48 hours (2 days) | -8 |
| between 48 and 72 hours (3 days) | -16 |
| between 72 and 96 hours (4 days) | -32 |
| more than 96 hours (4 days) | -64 (forget it!) |
The provided test program linpack.cu, is basically a CUDA port of the program you used for Assignment 1 (please refresh your memory of the strucurre of the blocked LU decompostion algorithm there beforen proceeding). Two test programs are built using the command make: linpackand matmultest. The latter is a generalized version of the program used in Lab 07.
The general usage for the Linpack program is:
The -v v option is used to select various versions of the benchmark. All of these can be tuned by the blocking factor NB.
-v 0 will select a computation using host (x86-64) BLAS (via the ATLAS library); this can be useful for calculating `speedups' over the GPU versions. -v 5 will also be done on the host using unoptimized C code; the algorithm was used to develop the `serial' CUDA kernels in linauxkernels.cu - inspect the code to see how this was done. Note that you can test these without having to run linpack under the runXe command, as the resulting code paths do not access a GPU.
Other values of v select GPU versions, kept in linsolve.cu. -v 1 calls dgetrfMM(), where only the matrix multiply part of the factorization is done on the GPU. It copies over the trailing sub-matrix over to the device, instructs the GPU to perform the multiply, and then copies the trailing sub-matrix back to the host. Obviously, data transfer time will be a limiting factor for this algorithm!
-v 2 calls dgetrfUMM(), where the update of the upper panel and the matrix multiply are performed on the GPU. It tries to be a bit smarter with respect to data transfer. The whole matrix is copied at the start. At each outer iteration, the lower panel is copied back from the GPU, and is factored on the host. It is then copied back to the GPU. Row swaps to the left are performed on the host; whereas those on the right are performed on the GPU. After the upper panel is completed on the GPU (if the compile switch USE_HOST_TRSM is defined, it will do do it on the host instead), it is copied back to the host. Finally the matrix multiply on the trailing sub-matrix is performed on the GPU.
-v 3 calls dgetrfAll() which performs all of the computation n the GPU: the matrix is copied to the GPU at the start, and is copied back at the end.
-v 4 calls dgetrfWild(), an optional routine that is yet to be implemented.
By default, these routines call the CUDA BLAS to perform all jobs on the GPU. If the -k option is used, it will instead call CUDA kernels in linkernels.cu the GPU. Currently, the kernels implemented are the matrix multiply algorithms mentioned in Lecture 7 and serial (single thread) kernels for the other parts of Linpack. It is your task to design and write more efficient kernels in linkernels.cu and add in the code to invoke them!
The general usage for the matrix multiply test program is:
-v 0 selects the parallel but otherwise unoptimized matMult_0() kernel in linauxkernels.cu. -v 1 can be used to select the matMult_1() kernel with the shared memory optimization. -v 2 is similar, but it uses the code in launchMatMult1K() to launch the kernel. -v 3 can be used to test your version of this routine launchMatMultK() (together with your improved matrix multiply kernel) in linkernels.cu. -v 4 can be used to test your rank-1 update kernel and launch code, through the routine launchRank1UpdateK().
You should study the general structure and organization of these files, and in particular linauxkernels.cu, which has examples of GPU kernel definitions and launching. These examples should demonstrate enough CUDA features for your needs for this assignment - if not, refer to the external CUDA documentation.
Note: it is desirable that your code can handle the case where the matrix size is not a multiple of thread block sizes. However, if this proves to be a problem, it can be avoided by using another kernel to `cleanup' the trailing thread blocks: see launchMatMult1K() for example code.
Also, observe the number of registers used by your kernel from the compiler output. Using this, can the size of your shared data arrays, Calculate the CUDA occupancy and compare this with your results from the matmultest program. Experimentally, a naive way of measuring occupancy can be calculated by dividing the time of single thread calculation M = N = 1 (minus the overhead to launch as kernel, see Q2) by the time for M = N = K divided by N2 (use a largish K here, say 1024). Why does this give an inflated value for occupancy?
Note: you need to change the #if SERIAL==1 lines to #if 0 and replace the assert(TODO) calls in linkernels.cu with your kernel launch code. If you can't get a kernel working correctly, restore the line to #if SERIAL==1 and go on to the next kernel.
When completed, repeat the measurements of Q1 with -v 3 -k and comment on the effectiveness of your kernels.
Critically evaluate this idea, from the point of view of feasibility in programming, and potential for performance. Calculate (count) the number of kernel invokations in dgetrfAll() as a function of N (you may ignore terms of the order of N/NB. Using your results in Q2 in the overhead of launching a kenrel, calculate the percentage of overall execution time for N=1000 and N=4000. To what extent does that support Zed's belief?
In terms of content, the bottom line is to show good and interesting insights into performance issues for a numerically-intensive computation on this highly multicore and multithreaded co-processor. Little or no marks will be given for pasting voluminous and uninterpreted program output.
In your report, you need not go to a lot of trouble with typesetting tables and figures. In fact, plain text format will be quite adequate. However, your report should be tidy and presentable (including free from spelling and grammatical errors).
You can structure your kernels (__global__) to call functions that can be run on either the GPU or the host (__host__ __device__). These can be called from a normal (test) program (you will have to write it!) and hence debugged on the host.
For the other kernels, it is suggested that you modify them one by one. Test your program after every change; if the residual check fails, you will know that it is due to what you last did! For small N, printing out the matrix at the various stages of the computation, and comparing this to the same stage in a correct version, may be helpful.
Note that small bugs in the index of absolute maximum kernel may not necessarily cause the residual check to fail (if it is returning a random result, it probably will!): if it returns an index reasonably close to the maximum, it will cause a loss of accuracy. Hence the residual may increase but still be within the threshold. You can test correctness by comparing the values of the pivot vector (P) of say runXe linpack -v 3 -k -p 20 with that of the same command but without -k: they should be the same.
Please direct all enquiries to:
Page authorised by: Head of Department, DCS |
| The Australian National University — CRICOS Provider Number 00120C |