COMP3320/6464 2006: Laboratory 3
Loops and Memory Access Patterns
The aim of this lab is to investigate various aspects of loops and
their performance
YOU MUST LOGON TO IWAKI TO DO THIS LAB
Start by creating a directory and copying all the relevant programs
into that directory:
mkdir lab3
cd lab3
cp -r /dept/dcs/comp3320/public/lab3/* .
Lab Assessment
You are not required to submit labs, but all lab material is
examinable! You are therefore advised to record you answers to each
question.
If you copied the files as above you will have a template answer file
called:
COMP3320_LAB3.txt
Where relevant write answers in plain text in this file. At the end
of the lab you should have created the following files
- COMP3320_LAB3.txt: text answers to questions below
- matvec0.c: matrix vector product included unrolled dot
product algorithms
- matvec1.c: COMP6464 only - matrix vector product
included unrolled daxpy based algorithms
- matvec2.c: benchmark code including modifications to
test unrolled dot product and (comp6464) daxpy algorithms
- myvec.c: unrolled dot product and (comp6464) daxpy
algorithms
Assembly, Aliasing and Dependency
Objective - to clearly see compiler generated loop unrolling in the
assembly code and to understand what is meant by aliasing and its
implications
File
vadd.c contains the following code:
void vadd(int n, double a[], double b[]){
int k;
for (k = 0; k < n; k++){
a[k] = a[k] + b[k];
}
}
that sums together two vectors. You should now appreciate that on
the SPARC architecture and under the right conditions this loop can
achieve a peak performance of 1 loop iteration every 3 clock cycles.
(If you are unclear about this talk to your tutor,i.e. me!)
We are going to have a look at the SPARC assembly. We generate this
using the -S compiler flag. Use the following command:
cc -fast -xarch=v8plusa -xprefetch=no -S vadd.c
or
make vadd.s
The assembly is placed in a file with a
.s postfix, ie
vadd.s.
Note the
-xprefetch=no command tells the compiler not to generate
prefetch instructions. Prefetches are like early loads that aim to
bring data from memory to cache in advance of needing the data. They
take up a load/store instruction slot - and in this case complicate
the assembly. (At a later stage you can try removing this line and
comparing the assembly.)
Inspect the assembly. The main part of the for loop should be
indicated by references to the label L900000107. You don't need to
know a huge amount of SPARC assembly, but here are some basic commands:
ldd - load double (8 bytes)
std - store double (8 bytes)
add - integer add
cmp - integer comparison
fadd - floating point add
ble,pt - branch if less or equal to zero to specified label. Note the
,pt means branch is predicted to be taken, pn is predicted
not taken
- From the assembly cut out all those lines that form the main body
of the loop. This is all assembly instructions that are repeated as
the L900000107 loop is repeated. (Remember SPARC assembly has a
branch delay slot)
- The compiler has automatically unrolled the loop. How many times
has it been unrolled. Justify your answer.
- Using the grouping rules for the SPARC architecture, and assuming
that data is loaded from level 1 cache (2 cycle latency), how
many cycles will it take for 1 iteration of this unrolled loop to
complete. There is one piece of
extra information you require. Store operations on the UltraSPARC
can actually be issued immediately after the instruction that
creates the data to be stored. This is because the store operation
goes to a store queue that waits for the data to arrive in the
correct register. Thus in one cycle we can do fadd %f1,%f2,%f3 and
then in the next cycle we can immediately issue a std %f3,[int_address].
We are now going to run
vadd.c. The calling program is
runadd.c. Type
make runadd1
This will produce the
runadd1 executable that can be run by typing
runadd1 number
where number is an integer less than 500. The program allocates an
array (
a), initializes the elements to 1.0 and then calls
vadd. The
values of the array are printed before and after calling
vadd.
- Look at the code. Write down an expression in terms of a[i],
a[i+n], a[i-n] etc to
indicated what the program is doing. (Convince yourself that the
program is working correctly, ie. for small n work out
results on paper by hand and make sure you agree with the program).
Now type
make runadd2
this will use exactly the same source code to produce the
runadd2 executable.
- Run runadd2 with input of 5, 10, 15, 20, and 25. You should find the
"final a" output is correct for input values of 5, 10, and 15 but wrong for
input values of 20 and 25. (Note: if this were a real life application
it would appear to work fine for small cases, but give big errors for
large! Highlighting the need to do extensive testing on real codes!)
However the results for runadd1 are
always correct. Give a detailed explanation as to what is happening and
why runadd2 gives errors (but only in some cases).
As an aside - type the following
make vadd-info
this uses the
-Wc,-Qms_pipe+info compiler option to give you
some useful information about the loops in the
vadd.c file
and whether or not they were pipelined.
Loop Unrolling: Matrix Vector Product
Objective - using a matrix vector product as an example learn how to
unroll outer loops and then to measure the
performance change
Programs
matvecX.c compute a matrix vector product for a
square matrix (
a) of dimension
n with a vector
b, summing the result into another vector
x. The
programs have two options depending on whether we want to use matrix
a or the transpose of
a.
Dot Product Formulation
Programs
matvec0.c uses a dot product based formulation with
the two options equating to the following:
NORMAL
for (i=0; i < n; i++){
sum=0.0
for (j=0; j < n; j++){
sum=sum+a[i][j]*b[j];
}
x[i]=x[i]+sum;
}
TRANSPOSED
for (i=0; i < n; i++){
sum=0.0
for (j=0; j < n; j++){
sum=sum+a[j][i]*b[j];
}
x[i]=x[i]+sum;
}
Compile the program by typing
make matvec0
The program takes as input the integer dimension of the matrix, eg
matvec0 3
Inspect the program, compile it and run it for a small integer input
value, eg 3. The program will print out the matrices and results, make
sure that the program is working as you expect, ie. multiply the
matrix and vector by hand on a piece of paper and verify the program
is indeed giving the correct results. The program should be correct
(ie. this is NOT a trick or test!),
the aim here is to confirm that you understand how to perform a matrix
vector product. If you have any problems ask your tutor.
The program currently stops after experiment 2 is complete. When you
are happy that you understand the program change the #define value of
STOP to be 0. The program will now attempt to run experiments
3 and 4.
In experiments 3 and 4 we want to use an unrolled matrix vector
product, in which the outer loop (loop i in the above code segment)
has been unrolled by a factor of 2. Code for experiment 3 is half
complete, while all the code for experiment 4 needs to be written
Experiment 3 uses the code segment:
printf("\n ------unrolled dot normal------\n");
for (i=0; i < n/2; i++){
mydot(&x[2*i],n,&a[2*i*n],1,b,1);
mydot(&x[2*i+1],n,&a[(2*i+1)*n],1,b,1);
// mydot_u2();
}
if (n%2 == 1)mydot(&x[n-1],n,&a[(n-1)*n],1,b,1);
}
This code gives the correct answers, but to complete the unrolling
process we must replace the first two
mydot calls by a single
new routine - labelled
mydot_u2. (A commented out call to
mydot_u2 is shown in the code above.
- Write the code for mydot_u2. Verify that it works
correctly. If you have any problems ask your tutor.
- Write the code to make experiment 4 work correctly. You should use
the same mydot_u2 code, but this time you need to write
the outer loops that call this routine, ie. the equivalent of the
small code section immediately above. Again verify that it works
correctly and ask your tutor if you have any problems. It is
important you get this correct.
Daxpy Formulation
Program
matvec1.c is analogous to
matvec0.c except
that the matrix
vector product is now based on a daxpy formulation, ie.
NORMAL
for (i=0; i < n; i++){
for (j=0; j < n; j++){
x[j]=x[j]+a[j][i]*b[i];
}
}
TRANSPOSED
for (i=0; i < n; i++){
for (j=0; j < n; j++){
x[j]=x[j]+a[i][j]*b[i];
}
}
As before the code works using normal daxpy operations. Verify that
the code works correctly and you understand what it is doing
COMP6464 Only
- Write code to enable experiment 7 to work in an analogous fashion
to experiment 5, but with two way outer loop unrolling.
- Write code to enable experiment 8 to work in an analogous fashion
to experiment 6, but with two way outer loop unrolling.
Performance Evaluation (COMP3320 and COMP6464)
In program
matvec2.c we will benchmark the performance of the dot
product and daxpy based matrix vector algorithms demonstrated in
matvec0.c and
matvec1.c. This time we will use much
larger matrices and no longer print them out. Compile the code by typing
make matvec2
As before, to run the code, give on the command line the dimension of
the matrix to be used, e.g.
matvec2 1000
The program loops over normal and transposed versions of the matrices
and also over over all different versions that are implemented. This
is achieved by the statements:
for (trans=0; trans < 2; trans++){
for (vers=0; vers < NVERSIONS; vers++){
The benchmark is run and the elapsed time measured by the lines:
elp_time(&isec0,&iusec0);
for (i=0; i < repeat;i++)matvecmult(vers,trans,nsize,a,b,c);
elp_time(&isec1,&iusec1);
where the value of
repeat is increased until the benchmark takes
longer than 500usec to complete (ie. well above resolution of the the
elapsed time clock). Since the same calculation is repeated several
times with different algorithms (values of
vers) we check that the
results produced by succesive experiments agree (to within some
threshold). If all looks okay we print out details of the benchmark
algorithm (values of
trans and
vers), the dimension
of the problem,
and the time taken for one matrix vector product. We then move on to
the next algorithm.
In addition to the dot product and daxpy schemes investigated above,
we will also use the Sun Optimized Performance Library. The function
we need is dgemv (d- double precision, ge - general, m - matrix, v -
vector) and we access the library via the -xlic_lib=sunperf
compiler flag and the sunperf.h header file. As it stands the
program uses three different values of vers corresponding to
the following:
----------------------------------
Version Comment
----------------------------------
0 - sunperf dgemv
1 - dot product not unrolled
2 - daxpy not unrolled
----------------------------------
- Run the code as is for n=1000 and complete the following table
----------------------------------------------------------------
The machine i'm using is IWAKI
Transpose Version Time(sec) MFLOPS
----------------------------------------------------------------
0 0
0 1
0 2
1 0
1 1
1 2
----------------------------------------------------------------
You should derive the MFLOPS from the recorded times. Explain how you did
this. Provide some detailed comments on the relative performance of
the different algorithms.
Augment the code to include your unrolled dot product and (comp6464
only) daxpy matrix vector products. Treat these as version 3 and 4
respectively.
- Now augment your table with performance data for the two unrolled
routines for n=1000.
----------------------------------------------------------------
The machine i'm using is ??????????
Transpose Version Time(sec) MFLOPS
----------------------------------------------------------------
0 3
0 4 (COMP6464 ONLY)
1 3
1 4 (COMP6464 ONLY)
----------------------------------------------------------------
Comment on the performance of these routines in comparison to the
other versions
- Finally matvec3.c is identical to matvec2.c, except
that the dot product and daxpy routines that were in
myvec.c are now explicitely included in the
matvec3.c file. Compile and run the code. You should now
find that the dot product and daxpy times are similar
(related). Explain what has happened.