CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
School of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP8320 Multicore Computing Assignment 2, 2011

CUDA LINPACK Benchmark on The Fermi GPU


Deadline: 17:00 on Tuesday 12 Sunday 16 October


(Please report any errors, omissions and ambiguities to the course lecturer)
(amendments to this document after its release will be marked in blue).


This assignment is worth 15% of your total course mark. It will be marked out of 60.

Submission

This assignment must be submitted electronically. This can be done using the submit comp8320 ass2 ... command (from the student system, not xe). The file names which you should use (and the only ones that will be accepted by the submit command) are:

    report.pdf
    linkernels.cu
and optionally:
    ass2.tar

Extensions

See http://cs.anu.edu.au/student/comp8320/assessment.php. Late penalties are as follows
How LatePenalty from 60 Marks
less than 1 hour-1
between 1 and 6 hour-2
between 6 and 24 hours (1 day)-4
between 24 and 48 hours (2 days)-8
between 48 and 72 hours (3 days)-16
between 72 and 96 hours (4 days)-32
more than 96 hours (4 days)-64 (forget it!)

Note: late penalties apply even if you submit just one file after the deadline.

Plagiarism

This is an individual assignment. Any significant assistance received from anyone except the course staff must be acknowledged at the top of report.pdf. See http://cs.anu.edu.au/student/comp8320/assessment.php.

Setup

Copy the files (from xe) for the assignment into your home directory area:
    cp -r /short/c07/ass2 .

The provided test program linpack.cu, is basically a CUDA port of the program you used for Assignment 1 (please refresh your memory of the strucurre of the blocked LU decompostion algorithm there beforen proceeding). Two test programs are built using the command make: linpackand matmultest. The latter is a generalized version of the program used in Lab 07.

The general usage for the Linpack program is:

    runXe ./linpack [-p] [-b NB] [-v v] [-w W] [-k] [-a info] N [r [dN]]
The parameter W is passed to the CUDA versions of dgetrf() in linkernels.cu; it can be used as the CUDA (square) thread block size (default value is 16). The -a option can be used if you want to pass additional information to your code in linkernels.cu.

The -v v option is used to select various versions of the benchmark. All of these can be tuned by the blocking factor NB.

-v 0 will select a computation using host (x86-64) BLAS (via the ATLAS library); this can be useful for calculating `speedups' over the GPU versions. -v 5 will also be done on the host using unoptimized C code; the algorithm was used to develop the `serial' CUDA kernels in linauxkernels.cu - inspect the code to see how this was done. Note that you can test these without having to run linpack under the runXe command, as the resulting code paths do not access a GPU.

Other values of v select GPU versions, kept in linsolve.cu. -v 1 calls dgetrfMM(), where only the matrix multiply part of the factorization is done on the GPU. It copies over the trailing sub-matrix over to the device, instructs the GPU to perform the multiply, and then copies the trailing sub-matrix back to the host. Obviously, data transfer time will be a limiting factor for this algorithm!

-v 2 calls dgetrfUMM(), where the update of the upper panel and the matrix multiply are performed on the GPU. It tries to be a bit smarter with respect to data transfer. The whole matrix is copied at the start. At each outer iteration, the lower panel is copied back from the GPU, and is factored on the host. It is then copied back to the GPU. Row swaps to the left are performed on the host; whereas those on the right are performed on the GPU. After the upper panel is completed on the GPU (if the compile switch USE_HOST_TRSM is defined, it will do do it on the host instead), it is copied back to the host. Finally the matrix multiply on the trailing sub-matrix is performed on the GPU.

-v 3 calls dgetrfAll() which performs all of the computation n the GPU: the matrix is copied to the GPU at the start, and is copied back at the end.

-v 4 calls dgetrfWild(), an optional routine that is yet to be implemented.

By default, these routines call the CUDA BLAS to perform all jobs on the GPU. If the -k option is used, it will instead call CUDA kernels in linkernels.cu the GPU. Currently, the kernels implemented are the matrix multiply algorithms mentioned in Lecture 7 and serial (single thread) kernels for the other parts of Linpack. It is your task to design and write more efficient kernels in linkernels.cu and add in the code to invoke them!

The general usage for the matrix multiply test program is:

    runXe ./matmultest [-p] [-v v] [-w W] [-l ld] M N K
This program is a rigorous test harness for your matrix multiply kernels. It simulates the linpack memory layout and also checks for buffer overwrites (note: a buffer overwrite should be regarded as an error - even if the result matrix passed the correctness test). It is provided as it will be much easier to debug your matrix multiply kernel from this program than from within the Linpack benchmark. Note that setting N = M+1 and K = NB will simulate the matrix dimensions used in the Linpack calculation.

-v 0 selects the parallel but otherwise unoptimized matMult_0() kernel in linauxkernels.cu. -v 1 can be used to select the matMult_1() kernel with the shared memory optimization. -v 2 is similar, but it uses the code in launchMatMult1K() to launch the kernel. -v 3 can be used to test your version of this routine launchMatMultK() (together with your improved matrix multiply kernel) in linkernels.cu. -v 4 can be used to test your rank-1 update kernel and launch code, through the routine launchRank1UpdateK().

You should study the general structure and organization of these files, and in particular linauxkernels.cu, which has examples of GPU kernel definitions and launching. These examples should demonstrate enough CUDA features for your needs for this assignment - if not, refer to the external CUDA documentation.

Your Tasks

In all of these tasks, you are expected to record your findings and comment on any conclusions in your report. It is recommended for later reference that for each series of measurements, you cut-and-paste a representative example of the command you that you used, e.g. runXe ./linpack -v 1 -b 48 8192 for one of the measurements in Q1.
  1. Find the (roughly) optimum values for NB for dgetrfBL() (note: run this version without runXe, i.e. just ./linpack -v 0 -b NB 4096), dgetrfMM() (-v 1), dgetrfUMM() (-v 2) and dgetrfAll() (-v 3). Also try the kernel version of the last (-v 3 -k). Do this for both N=4096 and N=8192 (omit -v 1). It is suggested that you try multiples of 8 or 16, until performance seems to `level out' (do not go beyond 64 unless you think you see something interesting). Tabulate your results. Calculate the speedups for the optimum value in each case. Comment on the trends on both performance and optimal values of NB.

  2. Devise experiments to determine the overhead of a CUDA launch operation, broken down to the cost of that for a single thread, the additional per-thread overhead within a single block and the per block overheads with multiple blocks. Record the results and comment. Hint: use runXe matmultest M N 1 for suitable values of M and N; take the best of 5 measurements in each case.

  3. Implement an optimized matrix multiply kernel, which is launched from launchMatMultK(). It should be optimized for the context of the Linpack benchmark runs in Q1, and be significantly faster than the provided kernel (matmult_1()). Find the optimal NB as for Q1 (use -v 3 -k), and also give results for the corresponding sizes in matmultest (use matmultest -v 3 N N NB). In your report, describe and justify your optimization strategy.

    Note: it is desirable that your code can handle the case where the matrix size is not a multiple of thread block sizes. However, if this proves to be a problem, it can be avoided by using another kernel to `cleanup' the trailing thread blocks: see launchMatMult1K() for example code.

    Also, observe the number of registers used by your kernel from the compiler output. Using this, can the size of your shared data arrays, Calculate the CUDA occupancy and compare this with your results from the matmultest program. Experimentally, a naive way of measuring occupancy can be calculated by dividing the time of single thread calculation M = N = 1 (minus the overhead to launch as kernel, see Q2) by the time for M = N = K divided by N2 (use a largish K here, say 1024). Why does this give an inflated value for occupancy?

  4. Implement parallel versions of the kernels for rank-1 update, triangular matrix solve, row swap, scale vector and finally index of absolute maximum. Note that the latter is a reduction; you are recommended to adapt the algorithm in Lecture 7. Discuss and justify your parallelization strategy in each case.

    Note: you need to change the #if SERIAL==1 lines to #if 0 and replace the assert(TODO) calls in linkernels.cu with your kernel launch code. If you can't get a kernel working correctly, restore the line to #if SERIAL==1 and go on to the next kernel.

    When completed, repeat the measurements of Q1 with -v 3 -k and comment on the effectiveness of your kernels.

  5. A GPU programmer called Zed believed that, in the context of Linpack, execution time was dominated by the launching of kernels, at least for moderate N. He then proposed to write the whole of dgetrf() as a single kernel!

    Critically evaluate this idea, from the point of view of feasibility in programming, and potential for performance. Calculate (count) the number of kernel invokations in dgetrfAll() as a function of N (you may ignore terms of the order of N/NB. Using your results in Q2 in the overhead of launching a kenrel, calculate the percentage of overall execution time for N=1000 and N=4000. To what extent does that support Zed's belief?

  6. (optional, for bonus marks) Profile your kernels (-v 3 -k) using the Compute Visual Profiler, as described in the NF wiki. Use runXe -profile linpack ... to do this (further instructions will be put here later) This will create a log file cvp_output_0.log giving information on each kernel call, and a summary file cvp_output_0.summ containing totals or averages (as appropriate) for each kernel. . Use both values of N as for Q1. Discuss the results, i.e. what is the likely bottleneck in your matrix multiply kernel, which of the others kernels is having the most impact on execution time?

  7. (optional, for bonus marks) The kernel mamult_1() has a curious flaw: for matrix sizes that are not exact multiples of thread blocks, it can cause the Linpack benchmark to fail. If you comment out the definition of AVOID_PARTIAL_BLOCKS in linkernels.cu (and restore the call to launchMatMult1K in linkernels.cu), the bug can be exposed by the command:
      runXe linpack -v 1 -k 1000
    Note that the problem is not exposed if you use -v 2 or -v 3 instead. Find an explanation for this strange behavior (if possible, verify it using an experiment), or at least show how the code can be fixed to avoid this.

  8. (optional, for bonus marks) The function dgetrfWild() may be coded if you have ideas on improving Linpack Benchmark performance in ways not covered so far (e.g. using different memory layouts). As above, explain and justify your ideas, and compare the results with the appropriate version of the Linpack Benchmark so far. If necessary, add any extra files in ass2.tar.

  9. (optional, for bonus marks) If you can discover any more interesting experiments in the Linpack Benchmark on the GPU, please describe them in your report for bonus marks. If necessary, add any extra files in ass2.tar. This should contain a README file describing the other contents and, if needed, how to compile and run these.

  10. Your report should also give overall conclusions on your work, e.g. commenting on the overall effectiveness of the compilers and architecture. Questions such as `what are the secrets of the CUDA BLAS?' could be addressed here. You could also comment on the cost-effectiveness of programming GPUs: how much speedup for how much effort? In particular, note that the CUDA BLAS code path in dgetrfAll() is little more than a `port' of the host BLAS version of dgetrfBL(). Even though you were not asked to do this, comment on how much effort do you think this would have taken, balanced against the performance attained.
Your report may be generated in your favorite format (this may include plain text!) but must be converted to standard PDF format before submission. The CUDA file(s) should contain your final (best) versions of your code, although interesting intermediate versions of code sections could be left there commented out (preferably with #ifdefs) and/or placed in your report.

Marking Guidelines

Your code must preserve correctness to obtain any marks for the associated task. Code tidiness, clarity, elegance and parallelization will be expected. Most marks will be allocated directly to the report. Quality of writing (grammar, sentence structure, organization) and clarity of explanations will be expected. Presentation of data is important. Don't include data/program output that is not relevant to the question being answered; don't present numbers with a meaningless number of digits.

In terms of content, the bottom line is to show good and interesting insights into performance issues for a numerically-intensive computation on this highly multicore and multithreaded co-processor. Little or no marks will be given for pasting voluminous and uninterpreted program output.

In your report, you need not go to a lot of trouble with typesetting tables and figures. In fact, plain text format will be quite adequate. However, your report should be tidy and presentable (including free from spelling and grammatical errors).

Hints

Debugging such numerical computations is hard! From a GPU, even harder! There is no way to perform a printf() statement from within a GPU kernel on the Xe. It is a matter of looking at the results (-p option) to infer the cause of a bug. Obviously, use matmultest to debug your matrix multiply and rank-1 update kernels.

You can structure your kernels (__global__) to call functions that can be run on either the GPU or the host (__host__ __device__). These can be called from a normal (test) program (you will have to write it!) and hence debugged on the host.

For the other kernels, it is suggested that you modify them one by one. Test your program after every change; if the residual check fails, you will know that it is due to what you last did! For small N, printing out the matrix at the various stages of the computation, and comparing this to the same stage in a correct version, may be helpful.

Note that small bugs in the index of absolute maximum kernel may not necessarily cause the residual check to fail (if it is returning a random result, it probably will!): if it returns an index reasonably close to the maximum, it will cause a loss of accuracy. Hence the residual may increase but still be within the threshold. You can test correctness by comparing the values of the pivot vector (P) of say runXe linpack -v 3 -k -p 20 with that of the same command but without -k: they should be the same.

Last modified: 7/10/2011, 17:40

Copyright | Disclaimer | Privacy | Contact ANU