Skip Navigation | ANU Home | Search ANU | Search FEIT | Feedback
The Australian National University
Faculty of Engineering and Information Technology (FEIT)
Department of Computer Science
Printer Friendly Version of this Document
High Performance Scientific Computing COMP3320

COMP3320/6464 2008: Laboratory 2

Hardware Architecture and Code Performance

The aim of this lab is to understand how hardware architecture can dramatically affect the performance of your code.

YOU MUST LOGON TO IWAKI TO DO THIS LAB

Start by creating a directory and copying all the relevant programs into that directory:

  mkdir lab2
  cd lab2
  cp -r /dept/dcs/comp3320/public/lab2/* .


Lab Assessment

You are not required to submit labs, but all lab material is examinable! You are therefore advised to record you answers to each question.

If you copied the files as above you will have a template answer file called:

COMP3320_LAB2.txt

Where relevant write answers in plain text in this file.
  • COMP3320_LAB2.txt: text answers to questions below


Hardware Performance Counters

Objective - to demonstrate the use of hardware performance counters to gain very accurate performance information

As discussed in lectures, the UltraSPARC CPU contain two hardware performance counter registers - pic0 and pic1. Exactly what events can be counted differs between UltraSPARC hardware. To find out what can be counted on Iwaki do "cpustat -h"

  1. What machine (use hostname) are your running on? If it is not iwaki log out and log on to iwaki!
  2. What is its clock rate (use /usr/local/bin/fpversion)?
As well as giving the processor speed, fpversion also gives the cache characteristics of the machine you are using in the form -xcache=16/32/1:1024/64/1. To interpret this do man cc and look for xcache.
  1. For BOTH the L1 and L2 caches what are their i) sizes, ii) cache line sizes, iii) associativity.
Program pcount repeatedly executes the following for loop
/* Begin Critical Loop */
    if (cpc_take_sample(&before) == -1) exit(-1);
    for (k = 0; k < nsize-1; k++){
         sum = sum + a[k]*b[k];
    }
    if (cpc_take_sample(&after) == -1) exit(-1);
/* End Critical Loop */
interrogating the hardware performance counters immediately before and after the for loop. Look at the code and make sure you understand what it is doing.

Compile pcount.c by typing make pcount. Run the code by typing pcount nsize nrpt where nsize and nrpt are explained below.

  1. The UltraSPARC III processor can initiate one floating point multiplication and one floating point addition each clock cycle. Given this how many cycles might you expect the above loop to take?
  2. If the line
             sum = sum + a[k]*b[k];
        
    were replaced by the line
             sum = sum + a[k]*(a[k]+b[k]);
        
    how many cycles would now be required to complete 1 iteration?
We are now going to use hardware performance counters to measure the performance of the sum=sum+a[k]*b[k] loop as a function of vector length. Program pcount takes as command line input two parameters, nsize and nrpt.
nsize is the length of the vectors a and b.
nrpt is the number of times we will repeat the whole timing exercise. (Look at code to ensure you understand this).
  1. Complete the following table.
    To gather data quickly you may find this command useful!
    iwaki> repeat 3 pcount 10 1000000 | grep Meas | 
           nawk '{if (NR%2 == 1){xx=$3}else{print xx " " xx/10000000 " " $3/xx}}'
    ...assuming you know awk!    
       
          ------------------------------------------------------
          nsize nrpt    Cycles     Cycle/iter   Inst/cycle
          ------------------------------------------------------
          10 1000000 
          10 1000000 
          10 1000000 
          
          100 100000 
          100 100000 
          100 100000 
          
          1000 10000 
          1000 10000 
          1000 10000 
          
          10000 1000 
          10000 1000 
          10000 1000 
          
          100000 100 
          100000 100 
          100000 100 
          
          1000000 10 
          1000000 10 
          1000000 10 
          ------------------------------------------------------
    
    Note that since the product of nsize*nrpt constant the number of times the line sum=sum+a[k]*b[k] is executed is constant. Record three results for each pair of nsize and nrpt values. In the table Cycle/iter is the number of cycles required to execute the line sum=sum+a[k]*b[k] and is computed by you. Inst/cycle is the number of instructions issued per cycle and is a measure of how well we are using the superscalar hardware. Again this quantity is computed by you. Use floating point numbers in your table where applicable.
  2. In the above table you should find that with increasing nsize the performance initially improves, but then gets worse. Explain why this is.
  3. Earlier you were asked to predict the best performance obtainable for this loop on the UltraSPARC III. What percentage of this theoretical peak is i) your fastest result ii) your slowest result? (You should find there is a big difference between the best and worse).
We now want to use hardware counters to study cache access characteristics. You can change the events to be calculated by setting the PERFEVENTS environment variable. If you look at the code you will see that in the absence of this environment variable it effectively defaults to:
    setenv PERFEVENTS pic0=Cycle_cnt,pic1=Instr_cnt
  1. By setting the relevant performance counters we can gather information on the level 1 data cache and level 2 or "ecache" hit and miss rate. State which performance counters you would use? (Recall from above that cpustat -h gives you a list of possible counters)
  2. Complete the following table:
    
    (Modify your shell/awk command from above)
        
          ------------------------------------------------------
                                      Level 1 D-Cache Reads
          nsize         nrpt         ref      misses        hits
          ------------------------------------------------------
              10     1000000
              10     1000000
              10     1000000
             100      100000
             100      100000
             100      100000
            1000       10000
            1000       10000
            1000       10000
           10000        1000
           10000        1000
           10000        1000
          100000         100
          100000         100
          100000         100
         1000000          10
         1000000          10
         1000000          10
          ------------------------------------------------------
    
          ------------------------------------------------------
                                      Level 2 D-Cache Access
          nsize         nrpt         ref      misses        hits
          ------------------------------------------------------
              10     1000000
              10     1000000
              10     1000000
             100      100000
             100      100000
             100      100000
            1000       10000
            1000       10000
            1000       10000
           10000        1000
           10000        1000
           10000        1000
          100000         100
          100000         100
          100000         100
         1000000          10
         1000000          10
         1000000          10
          ------------------------------------------------------
    
    

Cache Line Size

Objective - to show how careful arrangement of data can lead to much better cache usage and improved performance

Program align.c defines an array of the following struct:
    struct vector {
     int number;
     double x0, y0, z0;
     double x1, y1, z1;
     struct vector *previous;
    } ll[NUM];
it then proceeds to measure the cycles required to perform the following operation
    for ( i = 0; i < NUM; i++) {
      sum += fabs(ll[i].x0 - ll[i].x1);
    }
repeating this loop LOOPS times and printing out the minimum and maximum number of cycles used. Look at the code and make sure you uderstand what it is doing.

Compile align.c by typing make align. Run the code by typing align. There is no input.

  1. Run the code several times. (Use the repeat command you used above). What is the minimum number of cycles required?
  2. The code contains the line
        #pragma align 32 (ll)
        
    what does this line do? (Guess!)
  3. Add code to printout the address of each of the following
           ll[k].number
           ll[k].x0  ll[k].y0  ll[k].z0
           ll[k].x1  ll[k].y1  ll[k].z1
           ll[k].previous
        
    for k=0,1,1023 and 1024. What are these addresses?
  4. Change the definition of vector to be
        struct vector {
         int number;
         double x0, x1,
         double y0, y1;
         double z0, z1;
         struct vector *previous;
        } ll[NUM];
    
    what is the minimum cycle time now? It should now be significantly less. Explain why this is?
  5. Now change the definition of vector to be
        struct vector {
         int number;
         struct vector *previous;
         double x0, x1,
         double y0, y1;
         double z0, z1;
        } ll[NUM];
    
    what is the minimum cycle time now? It should be larger - why?

Cache Associativity

Objective - to see how cache line associativity can directly impact observed performance

Similar to align.c program assoc.c defines an array of struct:
    struct {
     double x1, x2, x3, x4;
    } ll[4*NUM];
and then proceeds to measure the cycles required to perform the following operation
    for ( i = 0; i < NUM; i++) {
      ll[i].x1 = ll[i].x1 + ll[i+1*stride].x1 + ll[i+2*stride].x1
                  + ll[i+3*stride].x1 + ll[i+4*stride].x1;
      ll[i].x2 = ll[i].x2 + ll[i+1*stride].x2 + ll[i+2*stride].x2
                  + ll[i+3*stride].x2 + ll[i+4*stride].x2;
      ll[i].x3 = ll[i].x3 + ll[i+1*stride].x3 + ll[i+2*stride].x3
                  + ll[i+3*stride].x3 + ll[i+4*stride].x3;
      ll[i].x4 = ll[i].x4 + ll[i+1*stride].x4 + ll[i+2*stride].x4
                  + ll[i+3*stride].x4 + ll[i+4*stride].x4;
    }
repeating this operation LOOPS times and printing out the minimum and maximum number of cycles used. The value of stride (corrected to be between 1 and NUM-1) is read as command line input.

Compile assoc.c by typing make assoc. Run the code by typing assoc stride .

  1. Complete the following table:
          ---------------------------------
                    Minimum Cycles
          Stride  Exp1    Exp2    Exp3
          ---------------------------------
             1
             2
             3
          1024
          2048
          4096
          ---------------------------------
    
    
  2. You should find that the performance decreases markedly for the last 3 cases. Explain why this is the case?
  3. Perform additional measurements with stides close to 2048, eg stride=2046,2047,2049,2050... etc. When exactly does the performance change markedly from good to bad or vice versa?
  4. You might have expected the above transition only to occur for stride=2048 or multiples thereof - why? Why does it occur for 1024?
  5. Recode this loop to remove this problem. (HINT you want to persuade the compile to move all data from a cache line into registers before that cache line is evicted!)
If you have a little time...
  • Use performance counters in assoc.c to investigate level 1 cache activity and hopefully to confirm the reasons you have just given for the poor performance!

COMP6464 Only : Stream

Objective - investigate a widely used 3rd party benchmark for measuring memory bandwidth

Stream is a widely used memory bandwidth test. Go to the source code directory at the Stream web site and download the follow file:
   stream.c
  1. Read the header to the stream.c and try compiling the code accordingly. Now run the code on iwaki and partch, and complete the following table. What compiler options did you use on both machines?
                              MACHINE 1 - Iwaki
        -------------------------------------------------------------
        Function      Rate (MB/s)   RMS time     Min time     Max time
        Copy:      
        Scale:     
        Add:       
        Triad:     
                              MACHINE 2 - Partch
        -------------------------------------------------------------
        Function      Rate (MB/s)   RMS time     Min time     Max time
        Copy:      
        Scale:     
        Add:       
        Triad:     
    
  2. The Copy operation corresponds to:
           for (i = 0; i < LENGTH; i++) a[i] = b[i];
        
    write similar expressions for Scale, Add and Triad
  3. Given you knowledge of computer hardware would you expect Stream to report any difference in the MB/s rate for each of the above operations (Copy, Scale, Add, Triad)? Provide justification for you answer.
  4. Now recompile the code on Iwaki using the -fast option. Record new data (it should be better than before).
                              MACHINE 1 - Iwaki
        -------------------------------------------------------------
        Function      Rate (MB/s)   RMS time     Min time     Max time
        Copy:      
        Scale:     
        Add:       
        Triad:     
    
  5. Based on the MB/s results obtained using the -fast compiler flag compute MFLOP rates for the Add and Triad results assuming we are working with 8byte data quantities. Show how you compute these rates. How do these results compare with the theoretical peak?
  6. From the web site work out the following
    • The Pentium IV system that gives the fastest single processor performance. Compare this with your results on partch.
    • What stream preformance might you expect for the APAC national facility SGI system
    • The machine that gives the best single processor results