COMP3320/6464 2012: Laboratory 3
Loops and Memory Access Patterns
The aim of this lab is to investigate various aspects of loops and
their performance
YOU MUST LOGON TO xe.nci.org.au TO DO THIS LAB
Start by logging on to xe.nci.org.au and mapping its directory. If you don't know how, take a look at lab2.
Copy lab3.tar.gz on xe and unpack it.
Set up your environment by typing:
- module load papi
- module load gcc/4.4.4
Lab Assessment
You are not required to submit labs, but all lab material is
examinable! You are therefore advised to record you answers to each
question.
If you copied the files as above you will have a template answer file
called:
COMP3320_LAB3.txt
Where relevant write answers in plain text in this file. At the end
of the lab you should have created the following files
- COMP3320_LAB3.txt: text answers to questions below
- matvec0.c: matrix vector product included unrolled dot
product algorithms
- matvec1.c: COMP6464 only - matrix vector product
included unrolled daxpy based algorithms
- matvec2.c: benchmark code including modifications to
test unrolled dot product and (comp6464) daxpy algorithms
- myvec.c: unrolled dot product and (comp6464) daxpy
algorithms
Assembly, Aliasing and Dependency
Objective - to clearly see compiler generated loop unrolling in the
assembly code and to understand what is meant by aliasing and its
implications
File vadd.c contains the following code:
void vadd(int n, double * __restrict__ a, double * __restrict__ b[]){
int k;
for (k = 0; k < n; k++){
a[k] = a[k] + b[k];
}
}
that sums together two vectors. The keyword
__restrict__ instructs the compiler to assume no pointer aliasing between these two parameters. This enables to generate more optimised code, however, it can lead to an error when a and b alias (the arrays are overlapped, eg. a = b);
Peak Performance
To execute an iteration of the loop, we load two values from memory, add them together and save the result to memory. This loop can achieve a peak performance of 1 loop iteration every 4 clock cycles assuming loop unrolling and instruction rearrangement.
Generating Assembly Code from C
We are going to have a look at the x86 64bit assembly. We generate this
using the -S compiler flag. Use the following command:
gcc -O2 -funroll-loops -S vadd.c
or
make vadd.s
The assembly is placed in a file with a
.s postfix, ie
vadd.s.
Compiler options
Note the -funroll-loops command tells the compiler to unroll the inner loop. We did not use -O3 as this enables SIMD SSE instructions.
Understanding Assembly
Inspect the assembly. The main part of the for loop should be
indicated by references to the label .L3. Above, you can see the loop precondition
You don't need to know a huge amount of x86 assembly. Note x86 is based on the CISC architecture. Thus, arithmetic instruction can work not only with registers but also with memory. Here are some basic commands:
movsd - move 8B from memory to a register or from a register to memory
addsd - floating point add
addq - integer add
cmpq - integer comparison
jne - jump if not equal
Registers are divided into groups
xmm0 - xmm8 - floating point registers (up to 128b)
rax - 64b register (64b version of x86 accumulator)
rdx - 64b register (64b version of x86 data register)
rsi - 64b register (64b version of x86 source index register)
Look at an example of one loop:
movsd (%rsi,%rax), %xmm12 // rax is the loop counter (k). rsi contains the address of a
// we load k-th element of a into xmm12 registers
addsd (%rdx,%rax), %xmm12 // rax is the loop counter (k), rdx contains the address of b
// we add b[k] element sitting in memory to the register xmm12
and store the result in the register xmm12
movsd %xmm12, (%rsi,%rax) // we save the content of the register xmm12 into a[k]
addq $8, %rax // we increment the loop counter by 8B (size of double)
Questions
- From the assembly cut out all those lines that form the main body
of the loop. This is all assembly instructions that are repeated as
the .L3 loop is repeated.
- The compiler has automatically unrolled the loop. How many times
has it been unrolled? Justify your answer.
We are now going to run
vadd.c. The calling program is
runadd.c. Type
make runadd1
This will produce the
runadd1 executable that can be run by typing
runadd1 number
where number is an integer less than 500. The program allocates an
array (
a), initialises the elements to 1.0 and then calls
vadd. The
values of the array are printed before and after calling
vadd.
- Convince yourself that the program is working correctly, ie. for
small n work out results on paper by hand and make sure you agree with
the program.
Now type
make runadd2
this will use exactly the same source code to produce the
runadd2 executable.
- Run runadd2 with input of 5, 10, 15, 20, and 25.
You should find the "final a" output is correct for input values
of 5 (up to 8), but wrong for
input values of 10 onwards. (Note: if this were a real life application
it would appear to work fine for small cases, but give big errors for
large! Highlighting the need to do extensive testing on real codes!)
However the results for runadd1 are
always correct.
Give a detailed explanation as to what is happening and why
runadd2 gives errors (but only in some cases).
Now, we will investigate the performance of both variants. Type
make runadd1_perf
make runadd2_perf
./runadd1_perf 1000 1000000
./runadd2_perf 1000 1000000
this will run the performance benchmark with array size of 1000. The benchmark is repeated 1000000.
- You should clearly see the performance difference. Write down the Average MFLOPS of both version and
explain how this is possible.
Loop Unrolling: Matrix Vector Product
Objective - using a matrix vector product as an example learn how to
unroll outer loops and then to measure the
performance change
Programs
matvecX.c compute a matrix vector product,
Ab =
x, for a square matrix (
A) of dimension
n with a
vector
b, summing the result into another vector
x. The programs have two options depending on whether we want
to use matrix
A or the transpose of
A.
Dot Product Formulation in matvec0.c
Programs
matvec0.c uses a dot product based formulation with
the two options equating to the following:
NORMAL Ab = c
for (i=0; i < n; i++){
sum=0.0
for (j=0; j < n; j++){
sum=sum+a[i][j]*b[j];
}
x[i]=x[i]+sum;
}
TRANSPOSED ATb = c
for (i=0; i < n; i++){
sum=0.0
for (j=0; j < n; j++){
sum=sum+a[j][i]*b[j];
}
x[i]=x[i]+sum;
}
Compilation
Compile the program by typing
make matvec0
The program takes as input the integer dimension of the matrix, eg
matvec0 3
Understanding the Program
Inspect the program, compile it and run it for a small integer input
value, eg 3. The program will print out the matrices and results, make
sure that the program is working as you expect, ie. multiply the
matrix and vector by hand on a piece of paper and verify the program
is indeed giving the correct results. The program should be correct
(ie. this is NOT a trick or test!),
the aim here is to confirm that you understand how to perform a matrix
vector product. If you have any problems ask your tutor.
Unrolling the Matrix Vector Product
The program currently stops after experiment 2 is complete. When you
are happy that you understand the program change the
#define value of
STOP to be 0. The program will now attempt to run experiments
3 and 4.
In experiments 3 and 4 we want to use an unrolled matrix vector
product, in which the outer loop (loop i in the above code segment)
has been unrolled by a factor of 2. Code for experiment 3 is half
complete, while all the code for experiment 4 needs to be written
Experiment 3 uses the code segment:
printf("\n ------unrolled dot normal------\n");
for (i=0; i < n/2; i++){
mydot(&x[2*i],n,&a[2*i*n],1,b,1);
mydot(&x[2*i+1],n,&a[(2*i+1)*n],1,b,1);
// mydot_u2();
}
if (n%2 == 1)mydot(&x[n-1],n,&a[(n-1)*n],1,b,1);
}
This code gives the correct answers, but to complete the unrolling
process we must combine the first two
mydot calls into one
new routine - labeled
mydot_u2. (A commented out call to
mydot_u2 is shown in the code above.
Questions
- Write the code for mydot_u2. Verify that it works
correctly. If you have any problems ask your tutor.
- Write the code to make experiment 4 work correctly. You should use
the same mydot_u2 code, but this time you need to write
the outer loops that call this routine, ie. the equivalent of the
small code section immediately above.
Again verify that it works
correctly and ask your tutor if you have any problems. It is
important you get this correct.
Daxpy Formulation in matvec1.c (Constant times a vector plus a vector)
Program
matvec1.c is analogous to
matvec0.c except
that the matrix
vector product is now based on a daxpy formulation, ie.
NORMAL Ab = x
for (i=0; i < n; i++){
for (j=0; j < n; j++){
x[j]=x[j]+a[j][i]*b[i];
}
}
TRANSPOSED ATb = x
for (i=0; i < n; i++){
for (j=0; j < n; j++){
x[j]=x[j]+a[i][j]*b[i];
}
}
As before the code works using normal daxpy operations. Verify that
the code works correctly and you understand what it is doing
COMP6464 Only
- Write code to enable experiment 7 to work in an analogous fashion
to experiment 5, but with two way outer loop unrolling.
- Write code to enable experiment 8 to work in an analogous fashion
to experiment 6, but with two way outer loop unrolling.
Performance Evaluation (COMP3320 and COMP6464)
In program
matvec2.c we will benchmark the performance of the dot
product and daxpy based matrix vector algorithms demonstrated in
matvec0.c and
matvec1.c. This time we will use much
larger matrices and no longer print them out. Compile the code by typing
Compilation
make matvec2
As before, to run the code, give on the command line the dimension of
the matrix to be used, e.g.
matvec2 1000
Implementation Details
The program loops over normal and transposed versions of the matrices
and also over over all different versions that are implemented. This
is achieved by the statements:
for (trans=0; trans < 2; trans++){
for (vers=0; vers < NVERSIONS; vers++){
The benchmark is run and the elapsed time measured by the lines:
elp_time(&isec0,&iusec0);
for (i=0; i < repeat;i++)matvecmult(vers,trans,nsize,a,b,c);
elp_time(&isec1,&iusec1);
where the value of
repeat is increased until the benchmark takes
longer than 500usec to complete (ie. well above resolution of the the
elapsed time clock). Since the same calculation is repeated several
times with different algorithms (values of
vers) we check that the
results produced by successive experiments agree (to within some
threshold). If all looks okay we print out details of the benchmark
algorithm (values of
trans and
vers), the dimension
of the problem,
and the time taken for one matrix vector product. We then move on to
the next algorithm.
----------------------------------
Version Comment
----------------------------------
0 - dot product not unrolled
1 - daxpy not unrolled
----------------------------------
Questions
- Run the code as is for n=1000 and complete the following table
----------------------------------------------------------------
The machine i'm using is xe.nci.org.au
Transpose Version Time(sec) MFLOPS
----------------------------------------------------------------
0 0
0 1
1 0
1 1
----------------------------------------------------------------
Run the code as is for n=10000 and complete the following table
----------------------------------------------------------------
The machine i'm using is xe.nci.org.au
Transpose Version Time(sec) MFLOPS
----------------------------------------------------------------
0 0
0 1
1 0
1 1
----------------------------------------------------------------
You should derive the MFLOPS from the recorded times. Explain how you did
this. Provide some detailed comments on the relative performance of
the different algorithms.
Questions - Analyze Unrolled Code
Augment the code to include your unrolled dot product and (comp6464
only) daxpy matrix vector products. Treat these as version 2 and 3
respectively.
- Now augment your table with performance data for the two unrolled
routines for n=1000.
----------------------------------------------------------------
The machine i'm using is xe.nci.org.au
Transpose Version Time(sec) MFLOPS
----------------------------------------------------------------
0 2
0 3 (COMP6464 ONLY)
1 2
1 3 (COMP6464 ONLY)
----------------------------------------------------------------
Now augment your table with performance data for the two unrolled
routines for n=10000.
----------------------------------------------------------------
The machine i'm using is xe.nci.org.au
Transpose Version Time(sec) MFLOPS
----------------------------------------------------------------
0 2
0 3 (COMP6464 ONLY)
1 2
1 3 (COMP6464 ONLY)
----------------------------------------------------------------
Comment on the performance of these routines in comparison to the
other versions.
COMP6464 Only
- comp6464 only: Use PAPI library to measure flops of different versions and verify your calculation with measurement. You can use template form lab2.