COMP8320 Multicore Computing
Assignment 1, 2011
OpenMP LINPACK Benchmark on the UltraSPARC T2
Deadline: 17:00 on Friday 02
Tuesday 06 September
(Please report any errors, omissions and ambiguities
to the course lecturer)
(amendments to this document after its release will be marked in
blue).
This assignment is worth
15% 20%
of your total course mark. It will be
marked out of 60.
Submission
This assignment must be submitted electronically. This can be done
using the
submit comp8320 ass1 ... command (from the student
system, not
wallaman).
The file names which you should use (and the
only ones that will be accepted by the submit command) are:
report.pdf
linsolve.c
matmult.c
placethreads.c
and optionally:
ass1.tar
Extensions
See
http://cs.anu.edu.au/student/comp8320/assessment.php.
Late penalties are as follows
| How Late | Penalty from 60 Marks |
|---|
| less than 1 hour | -1 |
| between 1 and 6 hour | -2 |
| between 6 and 24 hours (1 day) | -4 |
| between 24 and 48 hours (2 days) | -8 |
| between 48 and 72 hours (3 days) | -16 |
| between 72 and 96 hours (4 days) | -32 |
| more than 96 hours (4 days) | -64 (forget it!) |
Note: late penalties apply even if you submit just one file after the
deadline.
Plagiarism
This is an individual assignment. Any significant assistance received
from anyone except the course staff must be acknowledged at the top of
report.pdf. See
http://cs.anu.edu.au/student/comp8320/assessment.php.
Background
The
LINPACK Benchmark
is arguably (contentiously!) the most important benchmark in
High Performance Computing. It solves for
x an
N x N
linear system
A x = y, by factoring
A in-place first
using
blocked LU decomposition. It then uses this decomposition
A = P
L U, where
P is a row permutation matrix,
L is a
lower triangular matrix with a unit diagonal and
U is an upper
triangular matrix. It then finds the solution using
x = A' y,
where
A' = U' L' P' (the notation
A' means the inverse
of the matrix
A). Usually, the factorization is performed
in-place, i.e. the storage for
A is overwritten with the elements
of
L replacing the elements in the lower triangular half of
A (the diagonal elements are 1.0 and are not explicitly
stored), and similarly for elements of
U for the upper half.
An optimization commonly performed is to augment the matrix with the
initial vector (A|y), and factorize this N x (N+1)
matrix, leaving (L\U|z) in its place (that is, the lower half
of A has been overwritten with L, the upper half has
been overwritten by U, and y has been overwritten by
z, where z = L' P' y. Then the solution stage simply
involves x = U' z.
An unblocked version of LU decomposition (using Gaussian elimination)
is given below, in pseudo-C notation for an M x N
matrix A :
for (i=0; i < N; i++) {
p = i + index_largest_elememt(A(i:M-1, i));
swap A(i, :) and A(p, :);
A(i+1:M-1, i) /= A(j,j); /* divide all elements in vector on LHS by A(j,j)*/
A(i+1:M-1, i+1:N-1) -= A(i+1:M-1,j) * A(j, j+1:N-1);
/* subtract from sub-matrix on LHS the vector outer-product on RHS */
}
Here we use the `array section notation':
A(i1:i2, j1:j2) denotes the
(i2-i1+1) x (j2-j1+1) sub-matrix of
A, starting from
position
(i1, i2);
j1 can be omitted if it is the
lower bound (
0), and
j2 can be omitted if is the
upper bound (
N-1).
We can give similar pseudo-code for the blocked
algorithm; note that first nested i loop effectively
forms an unblocked LU decomposition of the rectangular lower panel
A(j:N-1, j:j+w-1); the concatenation of Tj
and Lj forms the lower factor; the upper triangular
factor is the area covered by the uis.
The blocked algorithm permits the computation to be expressed in terms
of (sub-) matrix-matrix operations, which in turn greatly reduces the
number of load and store instructions and improves cache performance.
The block size is commonly called NB. A value of NB =
N will force the blocked algorithm to behave as an unblocked
algorithm; a value of NB = 1 has a similar effect. The optimal
value of NB will be a moderate-sized number, dependent on the
memory system characteristics of the machine it will run on; 16 ≤
NB ≤ 80 are typical values.
To support the development of such algorithms, the BLAS
(Basic Linear Algebra Subprograms) library was introduced. This contains
generalized vector-vector, vector-matrix and matrix-matrix operations,
with highly optimized versions being available on most machines. The
generalized matrix-matrix multiplication routine dgemm()
dominates BLAS-based blocked LU decomposition performance for large
N.
Setup
Copy the files (from
wallaman) for the assignment
into your home directory area:
cp -r /dept/dcs/comp8320/public/ass1/ .
The provided program linpack.c, when compiled (via the command
make) and executed as linpack N, produces a
randomly generated N x N matrix A with values in the
range -1 to +1. It initializes the solution vector x0
similarly, then computes the corresponding input vector using y =
Ax0. It then runs the benchmark by calling
dgetrf() to factorize (A|y), overwriting it with
(L\U|z), and then finds the resulting solution using
dtrsv() which computes x = U' z. The permutation
matrix P is represented by a pivot vector P[], where
P[i] records the index of the row which was interchanged with
row i during the factorization.
x is compared with x0 via a normalized
residual calculation; a value less than one means that the result
is both correct and accurate, a value slightly greater than 1 means the
result is inaccurate, and a larger value than this indicates an error
(i.e. a bug in the Linpack benchmark code used). The benchmark is deemed
to have failed if the residual exceeds one.
As the program calls standard routines such as dgetrf(), which
follow Fortran conventions for arrays (column-major storage and indices
beginning from 1), the program creates A in column-major
storage. That is, logical elements A(i,j) and
A(i+1,j) are in consecutive double words, whereas
A(i,j) and A(i,j+1) are separated by ldA ≥
N double words. The code is set up so that logical element
A(i,j) may be accessed as either A[j][i] or
A[0]+i+j*ldA. For the same reason, returned quantities
referring to array indices are adjusted as if array indices started from
1 (hence a value p+1 in P[i] signifies that rows i and
p were interchanged).
The provided Makefile compiles the codes with the
/opt/SUNWspro/prod/bin/cc rather than the
OpenMP 3.0 compliant compiler in /opt/sunstudio12.1/bin/.
This is because the former gives better performance for this computation.
The general usage for the program is:
./linpack [-p] [-b NB] [-v v] [-c c] [-a info] N [r [dN]]
(just executing the command
./linpack will give a full
description of these parameters). The
-p can be used for
printing the matrices and vectors; this can be useful for debugging.
The
-a info can be used to pass the auxiliary information
via the string
info down to the matrix factorization
routines. The
-c c option can be used for placing threads on
cores (see below).
The -v v option is used to
select various versions of the benchmark. v=0 will perform the
computation by the optimized dgetrf() of the Sun Performance
Library. Other versions can be tuned by the blocking factor
NB. v=1 will select dgetrfBL() in
linaux.h; this uses the (optimized) BLAS routines in a fashion
corresponding to the blocked algorithm mentioned
above. Both of these versions will parallelize if
$OMP_NUM_THREADS is set to a value greater than 1.
v=2 will select the function dgetrfC(); this is a version
of dgetrfBL() with BLAS calls translated into C loops.
v=3 will select the function dgetrfCN(); this is (initially)
similar to dgetrfC(), except that you will later parallelize
this using nested parallelism.
v=4 will select the function dgetrfBLP()
and v=5 will select the function dgetrfWild().
These are all in linsolve.c; it is up to you to work on these!
dgetrfC() calls a matrix multiply routine matMult();
this is kept in a separate file matmult.c so it can be analyzed
(and if need be compiled) separately.
Finally, note that the UltraSPARC T2 was not designed for floating-point
intensive computations and it may prove difficult to approach the goal
of peak performance (6 cores * 1.2 GHz = 7.2 GFLOPs in the case of
wallaman). However, the assignment should give you experience
in designing, coding and analyzing OpenMP applications on a highly
multicore multithreaded processor.
Your Tasks
In all of these tasks, you are expected to record your findings
and comment on any conclusions in your report.
It is recommended that for each series of measurements, you cut-and-paste
a representative example of the command you that you used, e.g.
export OMP_NUM_THREADS=16; ./linpack -v 1 -b 8 4000
for the 2nd series of measurements in Q1.
- Find the optimum values for NB for dgetrfBL() at
N=1000 (1 thread) and N=4000 (16 threads).
It is suggested that you try multiples of 8, until performance clearly
begins to decrease. Tabulate your results. In general, what would you
expect to be an optimal value?
Hint: assume that the SunStudio BLAS operates by dividing
the matrix multiply up so that input matrices are in square chunks.
- Compare the speed and scalability of dgetrf() and
dgetrfBL() for both (i) 1, 2, 4, 8 threads at N=1000
and (ii) 8, 16, 24, 32, 48 threads at N=4000.
For the latter, use the optimal NB from the previous question. Again,
tabulate your results, including the speedup ratios (based on the
speed at the smallest number of threads).
- Before performing any modifications, determine the performance of
dgetrfC() (with only 1 thread) and find the optimal blocking
factor. You will probably notice that it is significantly slower.
Execute the command make matmult.s and inspect the file. Locate
the innermost loop (you will notice that the compiler has unrolled this
loop). Insert this code into your report, and perform a `cycle count' of
the loop (in tutorials, you will be given some instructions on how to do
cycle counts. For references on the SPARC instruction set, see COMP6300
lecture M4
and Laboratory
09). Does this explain single-thread performance? Briefly explain.
Furthermore, does the degree of unrolling have any effect on the choice
of optimal NB?
- Parallelize the function dgetrfC() (including
matMult()) using OpenMP directives, using only a single level
or parallelism. Experiment with different possibilities (including
parallelizing inner vs outer loops). Consider loop re-ordering where
possible. Try to obtain the best parallelization strategies, and
describe and justify them (e.g. why they are valid and why they should
be efficient on the T2) in your report. Also answer the following
question: is there any scope for OpenMP section directive; if
so would it be useful to try to use it?
Measure the speed and scalability of dgetrfC() for both
(i) 1, 2, 4, 8 threads for N=1000 and (ii) 8, 16, 24, 32, 48
threads for N=4000. Compare your results with
dgetrfBL(). What conclusions can you draw?
- Parallelize the function dgetrfCN() using two levels of
nested parallelism in at least one parallel region (other parallel
regions should use the same single level parallism strategy that you
used for the previous question). In the nested region(s), the total number
of threads should be $OMP_NUM_THREADS, and, provided
$OMP_NUM_THREADS is not prime, the number of threads at each
level should always exceed 1. Again, try to obtain the best strategy
for nested parallelism, and describe and justify it in your report.
To test that the nesting is working as expected, when the -a
p option is used, a single thread in each nested team should
print its team id (the thread id of its `ancestor') and the number of
threads in its team. This must only occur on the first time the
(first) nested region is entered (for full marks, you should ensure
the messages don't get repeated if dgetrfCN() gets called a
second time. e.g.
u8914893@wallaman:~> export OMP_NUM_THREADS=4; ./linpack -v 3 -a p 1000
u8914893@wallaman:~> export OMP_NUM_THREADS=16; ./linpack -v 3 -a p 1000
compute the Linpack benchmark, for N=1000 with NB=32 (alg version 3) using 16 threads
Nested team 2: 4 threads
Nested team 1: 4 threads
Nested team 0: 4 threads
Nested team 3: 4 threads
Linpack benchmark with N=1000 in 2.09689s @ 318 MFLOPS
PASSed residual check: 6.39e-03
u8914893@wallaman:~>
Take some appropriate measurements and compare the performance
with that of dgetrfC(), and discuss in your report.
- Implement OMP thread placement code in
placethreads.c. In a parallel region, get each thread to bind
itself to the required virtual CPU id (see placethreads.h)
using the processor_bind() system call (you will also
probably need each OpenMP thread to pass the Solaris thread id
(slightly different to the OpenMP thread id!) using the
thr_self() system call. Consult the Solaris man
pages for both of these; also see the thread placement code in
test_barriers.c from Lab 3. If the call fails, the thread
should call perror() with a message stating which thread
could not bind to which CPU id; it should ensure this message appears
cleanly when 2 or more threads encounter errors. It should then call
processor_bind() to confirm the bindings and list them in
order, as in:
u8914893@wallaman:~/comp8320/ass1> ./linpack -c 2 1000
compute the Linpack benchmark, for N=1000 with NB=32 (alg version 0) using 16 threads
OMP threads 0 to 15 are bound to cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Linpack benchmark with ...
u8914893@wallaman:~/comp8320/ass1> ./linpack -c 8 1000
compute the Linpack benchmark, for N=1000 with NB=32 (alg version 0) using 16 threads
thread 15 could not bind to CPU id 60: Invalid argument
thread 14 could not bind to CPU id 56: Invalid argument
OMP threads 0 to 15 are bound to cpus: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 -1 -1
Linpack benchmark with ...
Note that virtual CPU ids 0 to 3 are in the first group in core 0, and
ids 4 to 7 are in the second group. With both 8 and 16 threads, try
using thread placement for 1, 2, 4, and 6 cores. Explain the
results.
- In dgetrfBLP(), begin with the code of
dgetrfBL(), but explicitly parallelize the (important) BLAS
calls, using a parallel region; each thread executes a (or possibly
several) `segment' of the original BLAS operation. In the case of
dgemm(), consider making segments in either and both
dimensions of the 3rd matrix operand. As above, explain and justify
your strategy, and compare the results with dgetrfBL(). For
details on the BLAS API, see man dgemm etc.
- You will probably notice a difference in speeds in the various
versions. Use the Solaris cputrack and collect tools
to analyse and explain the performance differences between benchmark
versions 0 to 3. For example, is one version slower because it has
more TLB or level 2 cache misses, or does it simply execute more
instructions, or does it has more cycles when all threads are stalled?
Hint: use cputrack -T 100 -c ....
to get the overall counts (as profiling-based methods aren't as
accurate when the counts become small). Then, when a significant
difference is found, use collect to see in what functions the
the events are primarily occurring. Tip: in the
Functions tab, right clicking on a column header,
e.g. Inclusive OMP Wait Time, will sort the functions by that
count.
- (optional, for bonus marks)
The function dgetrfWild() may be coded if you
have ideas on improving Linpack Benchmark performance in ways not
covered so far. As above, explain and justify your ideas, and compare
the results with the appropriate version of the Linpack Benchmark so
far.
If necessary, add any extra files in ass1.tar.
- (optional, for bonus marks) If you can discover any more
interesting experiments in the Linpack Benchmark on the UltraSPARC T2
(e.g. exploring OpenMP 3.0 features), or on another highly multicore
platform at your disposal, please describe them in your report for
bonus marks. If necessary, add any extra files in
ass1.tar. This should contain a README file
describing the other contents, and if needed, how to compile and run
these.
- Your report should also give overall conclusions on your work, e.g.
commenting on the overall effectiveness of the compilers, hardware
threading and multicore architecture. Questions such as `what are the
secrets of dgetrf?' could be addressed here.
Your report may be generated in your favorite format (may include plain
text!) but must be converted to standard PDF format before submission.
The C files should contain your final (best) versions of your code,
although interesting intermediate versions of code sections could be
left there commented out (preferably with
#ifdefs) and/or
placed in your report.
Marking Guidelines
Your code must preserve correctness to obtain any marks for the
associated task. Code tidiness, clarity, elegance and parallelization
will be expected. Most marks will be allocated directly to the report.
Quality of writing (grammar, sentence structure, organization) and
clarity of explanations will be expected. Presentation of data
important. Don't include data/program output that is not relevant to
the question being answered; don't present numbers with a meaningless
number of digits.
In terms of content, the bottom line is to show good and interesting
insights into performance issues for a numerically-intensive
computation on this multicore, multithreaded processor. Little or no
marks will be given for pasting voluminous and uninterpreted program
output.
In your report, you need not go to a lot of trouble with typesetting
tables and figures. In fact, plain text format will be quite
adequate. However, your report should be tidy and presentable
(including free from spelling and grammatical errors).
Hints
Debugging such numerical computations is hard! However, as you can
start with correct code, test your program after every change; if the
residual check fails, you will know that it is due to what you last did!
For small
N, printing out the matrix at the various stages of
the computation, and comparing this to the same stage in a correct version,
may be helpful.
Last modified: 31/08/2011, 11:55