Skip Navigation | ANU Home | Search ANU | Search FEIT | Feedback
The Australian National University
Faculty of Engineering and Information Technology (FEIT)
Department of Computer Science
Printer Friendly Version of this Document
High Performance Scientific Computing COMP3320

COMP3320/6464 2006: Laboratory 3

Loops and Memory Access Patterns

The aim of this lab is to investigate various aspects of loops and their performance

YOU MUST LOGON TO IWAKI TO DO THIS LAB

Start by creating a directory and copying all the relevant programs into that directory:

  mkdir lab3
  cd lab3
  cp -r /dept/dcs/comp3320/public/lab3/* .


Lab Assessment

You are not required to submit labs, but all lab material is examinable! You are therefore advised to record you answers to each question.

If you copied the files as above you will have a template answer file called:

COMP3320_LAB3.txt

Where relevant write answers in plain text in this file. At the end of the lab you should have created the following files
  • COMP3320_LAB3.txt: text answers to questions below
  • matvec0.c: matrix vector product included unrolled dot product algorithms
  • matvec1.c: COMP6464 only - matrix vector product included unrolled daxpy based algorithms
  • matvec2.c: benchmark code including modifications to test unrolled dot product and (comp6464) daxpy algorithms
  • myvec.c: unrolled dot product and (comp6464) daxpy algorithms


Assembly, Aliasing and Dependency

Objective - to clearly see compiler generated loop unrolling in the assembly code and to understand what is meant by aliasing and its implications

File vadd.c contains the following code:
     void vadd(int n, double a[], double b[]){
       int k;
       for (k = 0; k < n; k++){
          a[k] = a[k] + b[k];
       }
     }
that sums together two vectors. You should now appreciate that on the SPARC architecture and under the right conditions this loop can achieve a peak performance of 1 loop iteration every 3 clock cycles. (If you are unclear about this talk to your tutor,i.e. me!)

We are going to have a look at the SPARC assembly. We generate this using the -S compiler flag. Use the following command:

	cc -fast -xarch=v8plusa -xprefetch=no -S vadd.c
or
        make vadd.s
The assembly is placed in a file with a .s postfix, ie vadd.s. Note the -xprefetch=no command tells the compiler not to generate prefetch instructions. Prefetches are like early loads that aim to bring data from memory to cache in advance of needing the data. They take up a load/store instruction slot - and in this case complicate the assembly. (At a later stage you can try removing this line and comparing the assembly.)

Inspect the assembly. The main part of the for loop should be indicated by references to the label L900000107. You don't need to know a huge amount of SPARC assembly, but here are some basic commands:

  ldd    - load double (8 bytes)
  std    - store double (8 bytes)
  add    - integer add
  cmp    - integer comparison
  fadd   - floating point add
  ble,pt - branch if less or equal to zero to specified label. Note the
           ,pt means branch is predicted to be taken, pn is predicted 
           not taken
  1. From the assembly cut out all those lines that form the main body of the loop. This is all assembly instructions that are repeated as the L900000107 loop is repeated. (Remember SPARC assembly has a branch delay slot)
  2. The compiler has automatically unrolled the loop. How many times has it been unrolled. Justify your answer.
  3. Using the grouping rules for the SPARC architecture, and assuming that data is loaded from level 1 cache (2 cycle latency), how many cycles will it take for 1 iteration of this unrolled loop to complete. There is one piece of extra information you require. Store operations on the UltraSPARC can actually be issued immediately after the instruction that creates the data to be stored. This is because the store operation goes to a store queue that waits for the data to arrive in the correct register. Thus in one cycle we can do fadd %f1,%f2,%f3 and then in the next cycle we can immediately issue a std %f3,[int_address].
We are now going to run vadd.c. The calling program is runadd.c. Type
	make runadd1
This will produce the runadd1 executable that can be run by typing
	runadd1 number
where number is an integer less than 500. The program allocates an array (a), initializes the elements to 1.0 and then calls vadd. The values of the array are printed before and after calling vadd.
  1. Look at the code. Write down an expression in terms of a[i], a[i+n], a[i-n] etc to indicated what the program is doing. (Convince yourself that the program is working correctly, ie. for small n work out results on paper by hand and make sure you agree with the program).
Now type
	make runadd2
this will use exactly the same source code to produce the runadd2 executable.
  1. Run runadd2 with input of 5, 10, 15, 20, and 25. You should find the "final a" output is correct for input values of 5, 10, and 15 but wrong for input values of 20 and 25. (Note: if this were a real life application it would appear to work fine for small cases, but give big errors for large! Highlighting the need to do extensive testing on real codes!) However the results for runadd1 are always correct. Give a detailed explanation as to what is happening and why runadd2 gives errors (but only in some cases).
As an aside - type the following
      make vadd-info
this uses the -Wc,-Qms_pipe+info compiler option to give you some useful information about the loops in the vadd.c file and whether or not they were pipelined.

Loop Unrolling: Matrix Vector Product

Objective - using a matrix vector product as an example learn how to unroll outer loops and then to measure the performance change

Programs matvecX.c compute a matrix vector product for a square matrix (a) of dimension n with a vector b, summing the result into another vector x. The programs have two options depending on whether we want to use matrix a or the transpose of a.

Dot Product Formulation

Programs matvec0.c uses a dot product based formulation with the two options equating to the following:
NORMAL
  for (i=0; i < n; i++){
    sum=0.0
    for (j=0; j < n; j++){
      sum=sum+a[i][j]*b[j];
    }
    x[i]=x[i]+sum;
  }

TRANSPOSED
  for (i=0; i < n; i++){
    sum=0.0
    for (j=0; j < n; j++){
      sum=sum+a[j][i]*b[j];
    }
    x[i]=x[i]+sum;
  }
Compile the program by typing
	make matvec0
The program takes as input the integer dimension of the matrix, eg
	matvec0 3
Inspect the program, compile it and run it for a small integer input value, eg 3. The program will print out the matrices and results, make sure that the program is working as you expect, ie. multiply the matrix and vector by hand on a piece of paper and verify the program is indeed giving the correct results. The program should be correct (ie. this is NOT a trick or test!), the aim here is to confirm that you understand how to perform a matrix vector product. If you have any problems ask your tutor.

The program currently stops after experiment 2 is complete. When you are happy that you understand the program change the #define value of STOP to be 0. The program will now attempt to run experiments 3 and 4.

In experiments 3 and 4 we want to use an unrolled matrix vector product, in which the outer loop (loop i in the above code segment) has been unrolled by a factor of 2. Code for experiment 3 is half complete, while all the code for experiment 4 needs to be written

Experiment 3 uses the code segment:

         printf("\n ------unrolled dot normal------\n");
        for (i=0; i < n/2; i++){
          mydot(&x[2*i],n,&a[2*i*n],1,b,1);
          mydot(&x[2*i+1],n,&a[(2*i+1)*n],1,b,1);
  //          mydot_u2();
        }
        if (n%2 == 1)mydot(&x[n-1],n,&a[(n-1)*n],1,b,1);
      }
This code gives the correct answers, but to complete the unrolling process we must replace the first two mydot calls by a single new routine - labelled mydot_u2. (A commented out call to mydot_u2 is shown in the code above.

  1. Write the code for mydot_u2. Verify that it works correctly. If you have any problems ask your tutor.
  2. Write the code to make experiment 4 work correctly. You should use the same mydot_u2 code, but this time you need to write the outer loops that call this routine, ie. the equivalent of the small code section immediately above. Again verify that it works correctly and ask your tutor if you have any problems. It is important you get this correct.

Daxpy Formulation

Program matvec1.c is analogous to matvec0.c except that the matrix vector product is now based on a daxpy formulation, ie.
NORMAL
  for (i=0; i < n; i++){
    for (j=0; j < n; j++){
      x[j]=x[j]+a[j][i]*b[i];
    }
  }

TRANSPOSED
  for (i=0; i < n; i++){
    for (j=0; j < n; j++){
      x[j]=x[j]+a[i][j]*b[i];
    }
  }
As before the code works using normal daxpy operations. Verify that the code works correctly and you understand what it is doing

COMP6464 Only

  1. Write code to enable experiment 7 to work in an analogous fashion to experiment 5, but with two way outer loop unrolling.
  2. Write code to enable experiment 8 to work in an analogous fashion to experiment 6, but with two way outer loop unrolling.

Performance Evaluation (COMP3320 and COMP6464)

In program matvec2.c we will benchmark the performance of the dot product and daxpy based matrix vector algorithms demonstrated in matvec0.c and matvec1.c. This time we will use much larger matrices and no longer print them out. Compile the code by typing
        make matvec2
As before, to run the code, give on the command line the dimension of the matrix to be used, e.g.
	matvec2 1000
The program loops over normal and transposed versions of the matrices and also over over all different versions that are implemented. This is achieved by the statements:
  for (trans=0; trans < 2; trans++){
     for (vers=0; vers < NVERSIONS; vers++){
The benchmark is run and the elapsed time measured by the lines:
     elp_time(&isec0,&iusec0);
     for (i=0; i < repeat;i++)matvecmult(vers,trans,nsize,a,b,c);
     elp_time(&isec1,&iusec1);
where the value of repeat is increased until the benchmark takes longer than 500usec to complete (ie. well above resolution of the the elapsed time clock). Since the same calculation is repeated several times with different algorithms (values of vers) we check that the results produced by succesive experiments agree (to within some threshold). If all looks okay we print out details of the benchmark algorithm (values of trans and vers), the dimension of the problem, and the time taken for one matrix vector product. We then move on to the next algorithm.

In addition to the dot product and daxpy schemes investigated above, we will also use the Sun Optimized Performance Library. The function we need is dgemv (d- double precision, ge - general, m - matrix, v - vector) and we access the library via the -xlic_lib=sunperf compiler flag and the sunperf.h header file. As it stands the program uses three different values of vers corresponding to the following:

             ----------------------------------
             Version   Comment
             ----------------------------------
                0   -  sunperf dgemv
                1   -  dot product not unrolled
                2   -  daxpy not unrolled
             ----------------------------------
  1. Run the code as is for n=1000 and complete the following table
        ----------------------------------------------------------------
                    The machine i'm using is IWAKI
        Transpose  Version   Time(sec)       MFLOPS
        ----------------------------------------------------------------
            0         0      
            0         1      
            0         2      
            1         0      
            1         1      
            1         2      
       ----------------------------------------------------------------
       
    You should derive the MFLOPS from the recorded times. Explain how you did this. Provide some detailed comments on the relative performance of the different algorithms.
Augment the code to include your unrolled dot product and (comp6464 only) daxpy matrix vector products. Treat these as version 3 and 4 respectively.
  1. Now augment your table with performance data for the two unrolled routines for n=1000.
        ----------------------------------------------------------------
                   The machine i'm using is ??????????
        Transpose  Version   Time(sec)       MFLOPS
        ----------------------------------------------------------------
            0         3      
            0         4      (COMP6464 ONLY)
            1         3      
            1         4      (COMP6464 ONLY)
       ----------------------------------------------------------------
        
    Comment on the performance of these routines in comparison to the other versions
  1. Finally matvec3.c is identical to matvec2.c, except that the dot product and daxpy routines that were in myvec.c are now explicitely included in the matvec3.c file. Compile and run the code. You should now find that the dot product and daxpy times are similar (related). Explain what has happened.