CECS Home | ANU Home | Search ANU
ANU College of Engineering and Computer Science
Research School of Computer Science
Printer Friendly Version of this Document

UniSAFE

High Performance Scientific Computing

COMP3320/6464 2012: Laboratory 3

Loops and Memory Access Patterns

The aim of this lab is to investigate various aspects of loops and their performance

YOU MUST LOGON TO xe.nci.org.au TO DO THIS LAB

Start by logging on to xe.nci.org.au and mapping its directory. If you don't know how, take a look at lab2. Copy lab3.tar.gz on xe and unpack it.
Set up your environment by typing:

  • module load papi
  • module load gcc/4.4.4


Lab Assessment

You are not required to submit labs, but all lab material is examinable! You are therefore advised to record you answers to each question.

If you copied the files as above you will have a template answer file called:

COMP3320_LAB3.txt

Where relevant write answers in plain text in this file. At the end of the lab you should have created the following files
  • COMP3320_LAB3.txt: text answers to questions below
  • matvec0.c: matrix vector product included unrolled dot product algorithms
  • matvec1.c: COMP6464 only - matrix vector product included unrolled daxpy based algorithms
  • matvec2.c: benchmark code including modifications to test unrolled dot product and (comp6464) daxpy algorithms
  • myvec.c: unrolled dot product and (comp6464) daxpy algorithms

Assembly, Aliasing and Dependency

Objective - to clearly see compiler generated loop unrolling in the assembly code and to understand what is meant by aliasing and its implications

File vadd.c contains the following code:

     void vadd(int n, double * __restrict__ a, double * __restrict__ b[]){
       int k;
       for (k = 0; k < n; k++){
          a[k] = a[k] + b[k];
       }
     }
that sums together two vectors. The keyword __restrict__ instructs the compiler to assume no pointer aliasing between these two parameters. This enables to generate more optimised code, however, it can lead to an error when a and b alias (the arrays are overlapped, eg. a = b);

Peak Performance

To execute an iteration of the loop, we load two values from memory, add them together and save the result to memory. This loop can achieve a peak performance of 1 loop iteration every 4 clock cycles assuming loop unrolling and instruction rearrangement.

Generating Assembly Code from C

We are going to have a look at the x86 64bit assembly. We generate this using the -S compiler flag. Use the following command:

	gcc -O2 -funroll-loops -S vadd.c
or
        make vadd.s
The assembly is placed in a file with a .s postfix, ie vadd.s.

Compiler options

Note the -funroll-loops command tells the compiler to unroll the inner loop. We did not use -O3 as this enables SIMD SSE instructions.

Understanding Assembly

Inspect the assembly. The main part of the for loop should be indicated by references to the label .L3. Above, you can see the loop precondition You don't need to know a huge amount of x86 assembly. Note x86 is based on the CISC architecture. Thus, arithmetic instruction can work not only with registers but also with memory. Here are some basic commands:
  movsd  - move 8B from memory to a register or from a register to memory
  addsd  - floating point add
  addq   - integer add
  cmpq   - integer comparison 
  jne    - jump if not equal

Registers are divided into groups

  xmm0 - xmm8	- floating point registers (up to 128b)
  rax           - 64b register (64b version of x86 accumulator)
  rdx           - 64b register (64b version of x86 data register)
  rsi           - 64b register (64b version of x86 source index register) 
Look at an example of one loop:
  movsd	(%rsi,%rax), %xmm12 // rax is the loop counter (k). rsi contains the address of a
			    // we load k-th element of a into xmm12 registers
  addsd	(%rdx,%rax), %xmm12 // rax is the loop counter (k), rdx contains the address of b
			    // we add b[k] element sitting in memory to the register xmm12
			       and store the result in the register xmm12
  movsd	%xmm12, (%rsi,%rax) // we save the content of the register xmm12 into a[k]
  addq	$8, %rax            // we increment the loop counter by 8B (size of double)

Questions

  1. From the assembly cut out all those lines that form the main body of the loop. This is all assembly instructions that are repeated as the .L3 loop is repeated.
  2. The compiler has automatically unrolled the loop. How many times has it been unrolled? Justify your answer.
We are now going to run vadd.c. The calling program is runadd.c. Type
	make runadd1
This will produce the runadd1 executable that can be run by typing
	runadd1 number
where number is an integer less than 500. The program allocates an array (a), initialises the elements to 1.0 and then calls vadd. The values of the array are printed before and after calling vadd.
  1. Convince yourself that the program is working correctly, ie. for small n work out results on paper by hand and make sure you agree with the program.
Now type
	make runadd2
this will use exactly the same source code to produce the runadd2 executable.
  1. Run runadd2 with input of 5, 10, 15, 20, and 25.

    You should find the "final a" output is correct for input values of 5 (up to 8), but wrong for input values of 10 onwards. (Note: if this were a real life application it would appear to work fine for small cases, but give big errors for large! Highlighting the need to do extensive testing on real codes!) However the results for runadd1 are always correct.

    Give a detailed explanation as to what is happening and why runadd2 gives errors (but only in some cases).

Now, we will investigate the performance of both variants. Type
    make runadd1_perf
    make runadd2_perf
    
    ./runadd1_perf 1000 1000000
    ./runadd2_perf 1000 1000000
this will run the performance benchmark with array size of 1000. The benchmark is repeated 1000000.
  1. You should clearly see the performance difference. Write down the Average MFLOPS of both version and explain how this is possible.

Loop Unrolling: Matrix Vector Product

Objective - using a matrix vector product as an example learn how to unroll outer loops and then to measure the performance change
Programs matvecX.c compute a matrix vector product, Ab = x, for a square matrix (A) of dimension n with a vector b, summing the result into another vector x. The programs have two options depending on whether we want to use matrix A or the transpose of A.

Dot Product Formulation in matvec0.c

Programs matvec0.c uses a dot product based formulation with the two options equating to the following:
NORMAL Ab = c
  for (i=0; i < n; i++){
    sum=0.0
    for (j=0; j < n; j++){
      sum=sum+a[i][j]*b[j];
    }
    x[i]=x[i]+sum;
  }

TRANSPOSED ATb = c
  for (i=0; i < n; i++){
    sum=0.0
    for (j=0; j < n; j++){
      sum=sum+a[j][i]*b[j];
    }
    x[i]=x[i]+sum;
  }

Compilation

Compile the program by typing
	make matvec0
The program takes as input the integer dimension of the matrix, eg
	matvec0 3

Understanding the Program

Inspect the program, compile it and run it for a small integer input value, eg 3. The program will print out the matrices and results, make sure that the program is working as you expect, ie. multiply the matrix and vector by hand on a piece of paper and verify the program is indeed giving the correct results. The program should be correct (ie. this is NOT a trick or test!), the aim here is to confirm that you understand how to perform a matrix vector product. If you have any problems ask your tutor.

Unrolling the Matrix Vector Product

The program currently stops after experiment 2 is complete. When you are happy that you understand the program change the #define value of STOP to be 0. The program will now attempt to run experiments 3 and 4.

In experiments 3 and 4 we want to use an unrolled matrix vector product, in which the outer loop (loop i in the above code segment) has been unrolled by a factor of 2. Code for experiment 3 is half complete, while all the code for experiment 4 needs to be written

Experiment 3 uses the code segment:

         printf("\n ------unrolled dot normal------\n");
        for (i=0; i < n/2; i++){
          mydot(&x[2*i],n,&a[2*i*n],1,b,1);
          mydot(&x[2*i+1],n,&a[(2*i+1)*n],1,b,1);
  //          mydot_u2();
        }
        if (n%2 == 1)mydot(&x[n-1],n,&a[(n-1)*n],1,b,1);
      }
This code gives the correct answers, but to complete the unrolling process we must combine the first two mydot calls into one new routine - labeled mydot_u2. (A commented out call to mydot_u2 is shown in the code above.

Questions

  1. Write the code for mydot_u2. Verify that it works correctly. If you have any problems ask your tutor.
  2. Write the code to make experiment 4 work correctly. You should use the same mydot_u2 code, but this time you need to write the outer loops that call this routine, ie. the equivalent of the small code section immediately above.

    Again verify that it works correctly and ask your tutor if you have any problems. It is important you get this correct.

Daxpy Formulation in matvec1.c (Constant times a vector plus a vector)

Program matvec1.c is analogous to matvec0.c except that the matrix vector product is now based on a daxpy formulation, ie.
NORMAL Ab = x
  for (i=0; i < n; i++){
    for (j=0; j < n; j++){
      x[j]=x[j]+a[j][i]*b[i];
    }
  }

TRANSPOSED ATb = x
  for (i=0; i < n; i++){
    for (j=0; j < n; j++){
      x[j]=x[j]+a[i][j]*b[i];
    }
  }
As before the code works using normal daxpy operations. Verify that the code works correctly and you understand what it is doing

COMP6464 Only

  1. Write code to enable experiment 7 to work in an analogous fashion to experiment 5, but with two way outer loop unrolling.
  2. Write code to enable experiment 8 to work in an analogous fashion to experiment 6, but with two way outer loop unrolling.

Performance Evaluation (COMP3320 and COMP6464)

In program matvec2.c we will benchmark the performance of the dot product and daxpy based matrix vector algorithms demonstrated in matvec0.c and matvec1.c. This time we will use much larger matrices and no longer print them out. Compile the code by typing

Compilation

        make matvec2
As before, to run the code, give on the command line the dimension of the matrix to be used, e.g.
	matvec2 1000

Implementation Details

The program loops over normal and transposed versions of the matrices and also over over all different versions that are implemented. This is achieved by the statements:
  for (trans=0; trans < 2; trans++){
     for (vers=0; vers < NVERSIONS; vers++){
The benchmark is run and the elapsed time measured by the lines:
     elp_time(&isec0,&iusec0);
     for (i=0; i < repeat;i++)matvecmult(vers,trans,nsize,a,b,c);
     elp_time(&isec1,&iusec1);
where the value of repeat is increased until the benchmark takes longer than 500usec to complete (ie. well above resolution of the the elapsed time clock). Since the same calculation is repeated several times with different algorithms (values of vers) we check that the results produced by successive experiments agree (to within some threshold). If all looks okay we print out details of the benchmark algorithm (values of trans and vers), the dimension of the problem, and the time taken for one matrix vector product. We then move on to the next algorithm.

             ----------------------------------
             Version   Comment
             ----------------------------------
                0   -  dot product not unrolled
                1   -  daxpy not unrolled
             ----------------------------------

Questions

  1. Run the code as is for n=1000 and complete the following table
        ----------------------------------------------------------------
                    The machine i'm using is xe.nci.org.au
        Transpose  Version   Time(sec)       MFLOPS
        ----------------------------------------------------------------
            0         0      
            0         1      
            1         0      
            1         1      
       ----------------------------------------------------------------
    
        Run the code as is for n=10000 and complete the following table
        
        ----------------------------------------------------------------
                    The machine i'm using is xe.nci.org.au
        Transpose  Version   Time(sec)       MFLOPS
        ----------------------------------------------------------------
            0         0      
            0         1      
            1         0      
            1         1      
       ----------------------------------------------------------------
       
    You should derive the MFLOPS from the recorded times. Explain how you did this. Provide some detailed comments on the relative performance of the different algorithms.

Questions - Analyze Unrolled Code

Augment the code to include your unrolled dot product and (comp6464 only) daxpy matrix vector products. Treat these as version 2 and 3 respectively.
  1. Now augment your table with performance data for the two unrolled routines for n=1000.
        ----------------------------------------------------------------
                   The machine i'm using is xe.nci.org.au
        Transpose  Version   Time(sec)       MFLOPS
        ----------------------------------------------------------------
            0         2      
            0         3      (COMP6464 ONLY)
            1         2      
            1         3      (COMP6464 ONLY)
       ----------------------------------------------------------------
    
        Now augment your table with performance data for the two unrolled
        routines for n=10000.
        
        ----------------------------------------------------------------
                   The machine i'm using is xe.nci.org.au
        Transpose  Version   Time(sec)       MFLOPS
        ----------------------------------------------------------------
            0         2      
            0         3      (COMP6464 ONLY)
            1         2      
            1         3      (COMP6464 ONLY)
       ----------------------------------------------------------------
        
    Comment on the performance of these routines in comparison to the other versions.

COMP6464 Only

  1. comp6464 only: Use PAPI library to measure flops of different versions and verify your calculation with measurement. You can use template form lab2.