CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
School of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP8320 Multicore Computing Assignment 1, 2011

OpenMP LINPACK Benchmark on the UltraSPARC T2


Deadline: 17:00 on Friday 02 Tuesday 06 September


(Please report any errors, omissions and ambiguities to the course lecturer)
(amendments to this document after its release will be marked in blue).


This assignment is worth 15% 20% of your total course mark. It will be marked out of 60.

Submission

This assignment must be submitted electronically. This can be done using the submit comp8320 ass1 ... command (from the student system, not wallaman). The file names which you should use (and the only ones that will be accepted by the submit command) are:

    report.pdf
    linsolve.c
    matmult.c
    placethreads.c
and optionally:
    ass1.tar

Extensions

See http://cs.anu.edu.au/student/comp8320/assessment.php. Late penalties are as follows
How LatePenalty from 60 Marks
less than 1 hour-1
between 1 and 6 hour-2
between 6 and 24 hours (1 day)-4
between 24 and 48 hours (2 days)-8
between 48 and 72 hours (3 days)-16
between 72 and 96 hours (4 days)-32
more than 96 hours (4 days)-64 (forget it!)

Note: late penalties apply even if you submit just one file after the deadline.

Plagiarism

This is an individual assignment. Any significant assistance received from anyone except the course staff must be acknowledged at the top of report.pdf. See http://cs.anu.edu.au/student/comp8320/assessment.php.

Background

The LINPACK Benchmark is arguably (contentiously!) the most important benchmark in High Performance Computing. It solves for x an N x N linear system A x = y, by factoring A in-place first using blocked LU decomposition. It then uses this decomposition A = P L U, where P is a row permutation matrix, L is a lower triangular matrix with a unit diagonal and U is an upper triangular matrix. It then finds the solution using x = A' y, where A' = U' L' P' (the notation A' means the inverse of the matrix A). Usually, the factorization is performed in-place, i.e. the storage for A is overwritten with the elements of L replacing the elements in the lower triangular half of A (the diagonal elements are 1.0 and are not explicitly stored), and similarly for elements of U for the upper half.

An optimization commonly performed is to augment the matrix with the initial vector (A|y), and factorize this N x (N+1) matrix, leaving (L\U|z) in its place (that is, the lower half of A has been overwritten with L, the upper half has been overwritten by U, and y has been overwritten by z, where z = L' P' y. Then the solution stage simply involves x = U' z.

An unblocked version of LU decomposition (using Gaussian elimination) is given below, in pseudo-C notation for an M x N matrix A :

    for (i=0; i < N; i++) {
      p = i + index_largest_elememt(A(i:M-1, i));  
      swap A(i, :) and A(p, :);
      A(i+1:M-1, i) /= A(j,j); /* divide all elements in vector on LHS by A(j,j)*/
      A(i+1:M-1, i+1:N-1) -= A(i+1:M-1,j) * A(j, j+1:N-1);
	   /* subtract from sub-matrix on LHS the vector outer-product on RHS */
    }
Here we use the `array section notation': A(i1:i2, j1:j2) denotes the (i2-i1+1) x (j2-j1+1) sub-matrix of A, starting from position (i1, i2); j1 can be omitted if it is the lower bound (0), and j2 can be omitted if is the upper bound (N-1).

We can give similar pseudo-code for the blocked algorithm; note that first nested i loop effectively forms an unblocked LU decomposition of the rectangular lower panel A(j:N-1, j:j+w-1); the concatenation of Tj and Lj forms the lower factor; the upper triangular factor is the area covered by the uis.

The blocked algorithm permits the computation to be expressed in terms of (sub-) matrix-matrix operations, which in turn greatly reduces the number of load and store instructions and improves cache performance. The block size is commonly called NB. A value of NB = N will force the blocked algorithm to behave as an unblocked algorithm; a value of NB = 1 has a similar effect. The optimal value of NB will be a moderate-sized number, dependent on the memory system characteristics of the machine it will run on; 16 ≤ NB ≤ 80 are typical values.

To support the development of such algorithms, the BLAS (Basic Linear Algebra Subprograms) library was introduced. This contains generalized vector-vector, vector-matrix and matrix-matrix operations, with highly optimized versions being available on most machines. The generalized matrix-matrix multiplication routine dgemm() dominates BLAS-based blocked LU decomposition performance for large N.

Setup

Copy the files (from wallaman) for the assignment into your home directory area:
    cp -r /dept/dcs/comp8320/public/ass1/ .

The provided program linpack.c, when compiled (via the command make) and executed as linpack N, produces a randomly generated N x N matrix A with values in the range -1 to +1. It initializes the solution vector x0 similarly, then computes the corresponding input vector using y = Ax0. It then runs the benchmark by calling dgetrf() to factorize (A|y), overwriting it with (L\U|z), and then finds the resulting solution using dtrsv() which computes x = U' z. The permutation matrix P is represented by a pivot vector P[], where P[i] records the index of the row which was interchanged with row i during the factorization.

x is compared with x0 via a normalized residual calculation; a value less than one means that the result is both correct and accurate, a value slightly greater than 1 means the result is inaccurate, and a larger value than this indicates an error (i.e. a bug in the Linpack benchmark code used). The benchmark is deemed to have failed if the residual exceeds one.

As the program calls standard routines such as dgetrf(), which follow Fortran conventions for arrays (column-major storage and indices beginning from 1), the program creates A in column-major storage. That is, logical elements A(i,j) and A(i+1,j) are in consecutive double words, whereas A(i,j) and A(i,j+1) are separated by ldA ≥ N double words. The code is set up so that logical element A(i,j) may be accessed as either A[j][i] or A[0]+i+j*ldA. For the same reason, returned quantities referring to array indices are adjusted as if array indices started from 1 (hence a value p+1 in P[i] signifies that rows i and p were interchanged).

The provided Makefile compiles the codes with the /opt/SUNWspro/prod/bin/cc rather than the OpenMP 3.0 compliant compiler in /opt/sunstudio12.1/bin/. This is because the former gives better performance for this computation.

The general usage for the program is:

    ./linpack [-p] [-b NB] [-v v] [-c c] [-a info] N [r [dN]]
(just executing the command ./linpack will give a full description of these parameters). The -p can be used for printing the matrices and vectors; this can be useful for debugging. The -a info can be used to pass the auxiliary information via the string info down to the matrix factorization routines. The -c c option can be used for placing threads on cores (see below).

The -v v option is used to select various versions of the benchmark. v=0 will perform the computation by the optimized dgetrf() of the Sun Performance Library. Other versions can be tuned by the blocking factor NB. v=1 will select dgetrfBL() in linaux.h; this uses the (optimized) BLAS routines in a fashion corresponding to the blocked algorithm mentioned above. Both of these versions will parallelize if $OMP_NUM_THREADS is set to a value greater than 1.

v=2 will select the function dgetrfC(); this is a version of dgetrfBL() with BLAS calls translated into C loops. v=3 will select the function dgetrfCN(); this is (initially) similar to dgetrfC(), except that you will later parallelize this using nested parallelism. v=4 will select the function dgetrfBLP() and v=5 will select the function dgetrfWild(). These are all in linsolve.c; it is up to you to work on these!

dgetrfC() calls a matrix multiply routine matMult(); this is kept in a separate file matmult.c so it can be analyzed (and if need be compiled) separately.

Finally, note that the UltraSPARC T2 was not designed for floating-point intensive computations and it may prove difficult to approach the goal of peak performance (6 cores * 1.2 GHz = 7.2 GFLOPs in the case of wallaman). However, the assignment should give you experience in designing, coding and analyzing OpenMP applications on a highly multicore multithreaded processor.

Your Tasks

In all of these tasks, you are expected to record your findings and comment on any conclusions in your report. It is recommended that for each series of measurements, you cut-and-paste a representative example of the command you that you used, e.g. export OMP_NUM_THREADS=16; ./linpack -v 1 -b 8 4000 for the 2nd series of measurements in Q1.
  1. Find the optimum values for NB for dgetrfBL() at N=1000 (1 thread) and N=4000 (16 threads). It is suggested that you try multiples of 8, until performance clearly begins to decrease. Tabulate your results. In general, what would you expect to be an optimal value? Hint: assume that the SunStudio BLAS operates by dividing the matrix multiply up so that input matrices are in square chunks.

  2. Compare the speed and scalability of dgetrf() and dgetrfBL() for both (i) 1, 2, 4, 8 threads at N=1000 and (ii) 8, 16, 24, 32, 48 threads at N=4000. For the latter, use the optimal NB from the previous question. Again, tabulate your results, including the speedup ratios (based on the speed at the smallest number of threads).

  3. Before performing any modifications, determine the performance of dgetrfC() (with only 1 thread) and find the optimal blocking factor. You will probably notice that it is significantly slower. Execute the command make matmult.s and inspect the file. Locate the innermost loop (you will notice that the compiler has unrolled this loop). Insert this code into your report, and perform a `cycle count' of the loop (in tutorials, you will be given some instructions on how to do cycle counts. For references on the SPARC instruction set, see COMP6300 lecture M4 and Laboratory 09). Does this explain single-thread performance? Briefly explain. Furthermore, does the degree of unrolling have any effect on the choice of optimal NB?

  4. Parallelize the function dgetrfC() (including matMult()) using OpenMP directives, using only a single level or parallelism. Experiment with different possibilities (including parallelizing inner vs outer loops). Consider loop re-ordering where possible. Try to obtain the best parallelization strategies, and describe and justify them (e.g. why they are valid and why they should be efficient on the T2) in your report. Also answer the following question: is there any scope for OpenMP section directive; if so would it be useful to try to use it?

    Measure the speed and scalability of dgetrfC() for both (i) 1, 2, 4, 8 threads for N=1000 and (ii) 8, 16, 24, 32, 48 threads for N=4000. Compare your results with dgetrfBL(). What conclusions can you draw?

  5. Parallelize the function dgetrfCN() using two levels of nested parallelism in at least one parallel region (other parallel regions should use the same single level parallism strategy that you used for the previous question). In the nested region(s), the total number of threads should be $OMP_NUM_THREADS, and, provided $OMP_NUM_THREADS is not prime, the number of threads at each level should always exceed 1. Again, try to obtain the best strategy for nested parallelism, and describe and justify it in your report.

    To test that the nesting is working as expected, when the -a p option is used, a single thread in each nested team should print its team id (the thread id of its `ancestor') and the number of threads in its team. This must only occur on the first time the (first) nested region is entered (for full marks, you should ensure the messages don't get repeated if dgetrfCN() gets called a second time. e.g. u8914893@wallaman:~> export OMP_NUM_THREADS=4; ./linpack -v 3 -a p 1000
    u8914893@wallaman:~> export OMP_NUM_THREADS=16; ./linpack -v 3 -a p 1000

    compute the Linpack benchmark, for N=1000 with NB=32 (alg version 3) using 16 threads
    Nested team 2: 4 threads
    Nested team 1: 4 threads
    Nested team 0: 4 threads
    Nested team 3: 4 threads
    Linpack benchmark with N=1000 in 2.09689s @ 318 MFLOPS
    PASSed residual check: 6.39e-03
    u8914893@wallaman:~>
    
    Take some appropriate measurements and compare the performance with that of dgetrfC(), and discuss in your report.

  6. Implement OMP thread placement code in placethreads.c. In a parallel region, get each thread to bind itself to the required virtual CPU id (see placethreads.h) using the processor_bind() system call (you will also probably need each OpenMP thread to pass the Solaris thread id (slightly different to the OpenMP thread id!) using the thr_self() system call. Consult the Solaris man pages for both of these; also see the thread placement code in test_barriers.c from Lab 3. If the call fails, the thread should call perror() with a message stating which thread could not bind to which CPU id; it should ensure this message appears cleanly when 2 or more threads encounter errors. It should then call processor_bind() to confirm the bindings and list them in order, as in:
      u8914893@wallaman:~/comp8320/ass1> ./linpack -c 2 1000
      compute the Linpack benchmark, for N=1000 with NB=32 (alg version 0) using 16 threads
      OMP threads 0 to 15 are bound to cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
      Linpack benchmark with ...
      u8914893@wallaman:~/comp8320/ass1> ./linpack -c 8 1000
      compute the Linpack benchmark, for N=1000 with NB=32 (alg version 0) using 16 threads
      thread 15 could not bind to CPU id 60: Invalid argument
      thread 14 could not bind to CPU id 56: Invalid argument
      OMP threads 0 to 15 are bound to cpus: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 -1 -1
      Linpack benchmark with ...
      

    Note that virtual CPU ids 0 to 3 are in the first group in core 0, and ids 4 to 7 are in the second group. With both 8 and 16 threads, try using thread placement for 1, 2, 4, and 6 cores. Explain the results.

  7. In dgetrfBLP(), begin with the code of dgetrfBL(), but explicitly parallelize the (important) BLAS calls, using a parallel region; each thread executes a (or possibly several) `segment' of the original BLAS operation. In the case of dgemm(), consider making segments in either and both dimensions of the 3rd matrix operand. As above, explain and justify your strategy, and compare the results with dgetrfBL(). For details on the BLAS API, see man dgemm etc.

  8. You will probably notice a difference in speeds in the various versions. Use the Solaris cputrack and collect tools to analyse and explain the performance differences between benchmark versions 0 to 3. For example, is one version slower because it has more TLB or level 2 cache misses, or does it simply execute more instructions, or does it has more cycles when all threads are stalled? Hint: use cputrack -T 100 -c .... to get the overall counts (as profiling-based methods aren't as accurate when the counts become small). Then, when a significant difference is found, use collect to see in what functions the the events are primarily occurring. Tip: in the Functions tab, right clicking on a column header, e.g. Inclusive OMP Wait Time, will sort the functions by that count.

  9. (optional, for bonus marks) The function dgetrfWild() may be coded if you have ideas on improving Linpack Benchmark performance in ways not covered so far. As above, explain and justify your ideas, and compare the results with the appropriate version of the Linpack Benchmark so far. If necessary, add any extra files in ass1.tar.

  10. (optional, for bonus marks) If you can discover any more interesting experiments in the Linpack Benchmark on the UltraSPARC T2 (e.g. exploring OpenMP 3.0 features), or on another highly multicore platform at your disposal, please describe them in your report for bonus marks. If necessary, add any extra files in ass1.tar. This should contain a README file describing the other contents, and if needed, how to compile and run these.

  11. Your report should also give overall conclusions on your work, e.g. commenting on the overall effectiveness of the compilers, hardware threading and multicore architecture. Questions such as `what are the secrets of dgetrf?' could be addressed here.
Your report may be generated in your favorite format (may include plain text!) but must be converted to standard PDF format before submission. The C files should contain your final (best) versions of your code, although interesting intermediate versions of code sections could be left there commented out (preferably with #ifdefs) and/or placed in your report.

Marking Guidelines

Your code must preserve correctness to obtain any marks for the associated task. Code tidiness, clarity, elegance and parallelization will be expected. Most marks will be allocated directly to the report. Quality of writing (grammar, sentence structure, organization) and clarity of explanations will be expected. Presentation of data important. Don't include data/program output that is not relevant to the question being answered; don't present numbers with a meaningless number of digits.

In terms of content, the bottom line is to show good and interesting insights into performance issues for a numerically-intensive computation on this multicore, multithreaded processor. Little or no marks will be given for pasting voluminous and uninterpreted program output.

In your report, you need not go to a lot of trouble with typesetting tables and figures. In fact, plain text format will be quite adequate. However, your report should be tidy and presentable (including free from spelling and grammatical errors).

Hints

Debugging such numerical computations is hard! However, as you can start with correct code, test your program after every change; if the residual check fails, you will know that it is due to what you last did! For small N, printing out the matrix at the various stages of the computation, and comparing this to the same stage in a correct version, may be helpful.

Last modified: 31/08/2011, 11:55

Copyright | Disclaimer | Privacy | Contact ANU