CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
School of Computer Science
Printer Friendly Version of this Document

UniSAFE

High Performance Scientific Computing

COMP3320/6464 2008: Laboratory 2

Hardware Architecture and Code Performance

The aim of this lab is to understand how hardware architecture can dramatically affect the performance of your code.

ATTN: You must log on to wallaman to complete this lab

The machine wallaman.anu.edu.au is an 1.4 GHz UltraSPARC T2 with approximately 8 GB of physical memory and a 4 MB L2 cache (16-way set associative, 64 byte lines). It contains 8 physical cores, linked via a high bandwidth crossbar. Each physical core supports 8 hardware threads which are designed to fully exploit the two execution pipelines present on each core. More details about this machine can be found here.

Start by creating a directory for lab2 and copying all the content in the following tar file on to that directory: lab2.tar.gz


Lab Assessment

You are not required to submit labs, but all lab material is examinable! You are therefore advised to record you answers to each question.

If you copied the files as above you will have a template answer file called:

COMP3320_LAB2.txt

Where relevant write answers in plain text in this file.
  • COMP3320_LAB2.txt: text answers to questions below


Hardware Performance Counters

Objective - to demonstrate the use of hardware performance counters to gain very accurate performance information

As discussed in lectures, the UltraSPARC CPU contain two hardware performance counter registers - pic0 and pic1. Exactly what combination of events can be counted via pic0 and pic1 differs between UltraSPARC hardware. To find out what can be counted on wallaman type "/usr/sbin/cpustat -h" on ther terminal (see link for descriptions of individual counter events)

  1. What machine (use hostname) are your running on? If it is not wallaman log out and log on to wallaman!
  2. What is its clock rate (use /usr/bin/fpversion)?
As well as giving the processor speed, fpversion may also give the cache characteristics of the machine you are using in the form -xtarget=ultraT2 -xcache=8/16/4:4096/64/16. To interpret this do man cc and look for xcache.
  1. For BOTH the L1 and L2 caches what are their i) sizes, ii) cache line sizes, iii) associativity (go by the above xcache values if fpversion merely returns "-xcache=generic").
Program pcount repeatedly executes the following for loop
/* Begin Critical Loop */
    hwevts_begin();
 
    for (k = 0; k < nsize; k++){
        sum = sum + a[k]*b[k];
    }

    hwevts_end();

/* End Critical Loop */
interrogating the hardware performance counters immediately before and after the for loop. Look at the code and make sure you understand what it is doing. To greatly expedite the lab, we have provided you with a number of wrapper functions to libcpc (Sun's CPU performance counter library) in hwevts.c. Note that hwevts can also call PAPI if you are using linux (ie. ARCH_SPARC for Sun, ARCH_I686 for linux). Important functions include:
  • hwevts_init() - This function initializes performance counting, it should be called once at the beginning. By default, it initialises the first set of performance counters events specified in hwevts.c.
  • hwevts_setEvt(index) - Sets performance counter events to those defined at index.
  • hwevts_begin() - Begins counting of current performance counter events.
  • hwevts_end() - Ends count of current performance counter events.
  • hwevts_getCtrs(&ctr, &c) - Gets count of current counters (ctr), and the cycle count between the call to begin and end(c)
It is up to you to make sense of what hwevts does internally.

Note: You may also use Analyzer to help answer some of the questions in this lab. Remember to unhook hwevts from your code before doing so, as Analyzer also uses libcpc.

  1. The wrapper functions add overhead to the event counts. Devise a simple program which will roughly determine the extent of the overhead of hwevts_begin and hwevts_end. How does this define the way you will use these counters?
Compile pcount.c by typing gmake pcount. Run the code by typing pcount nsize nrpt where nsize and nrpt are explained below. You may need to use the full path to gmake which is /usr/sfw/bin/gmake.
  1. For one thread, the UltraSPARC T2 processor can initiate one floating point operation each clock cycle. Given this how many cycles might you expect the above loop to take?
  2. If the line
             sum = sum + a[k]*b[k];
        
    were replaced by the line
             sum = sum + a[k]*(a[k]+b[k]);
        
    how many cycles would now be required to complete 1 iteration?
We are now going to use hardware performance counters to measure the performance of the sum=sum+a[k]*b[k] loop as a function of vector length. The program pcount takes as command line input two parameters, nsize and nrpt.
nsize is the length of the vectors a and b.


nrpt is the number of times we will repeat the whole timing exercise. (Look at code to ensure you understand this).

  1. Complete the following table.
    To gather data quickly you may find this command useful!
    wallaman> repeat 3 pcount 10 1000000 | grep Meas | 
           nawk '{if (NR%2 == 1){xx=$3}else{print xx " " xx/10000000 " " $3/xx}}'
    ...assuming you know awk!    
       
          ------------------------------------------------------
          nsize nrpt    Cycles     Cycle/iter   Inst/cycle
          ------------------------------------------------------
          100 1000000 
          100 1000000 
          100 1000000 
          
          200 100000 
          200 100000 
          200 100000 
    
          400 100000 
          400 100000 
          400 100000 
          
          800 100000 
          800 100000 
          800 100000 
          
          1000 10000 
          1000 10000 
          1000 10000 
          
          10000 1000 
          10000 1000 
          10000 1000 
    
          100000 100 
          100000 100 
          100000 100 
          
          1000000 10 
          1000000 10 
          1000000 10 
          ------------------------------------------------------
    
    In the table Cycle/iter is the number of cycles required to execute the line sum=sum+a[k]*b[k] and is computed by you. Inst/cycle is the number of instructions issued per cycle and is a measure of how well we are using the superscalar hardware. Again this quantity is computed by you. Use floating point numbers in your table where applicable.
  2. In the above table you should find that with increasing nsize the performance initially improves, but then gets worse. Explain why this is.
  3. Earlier you were asked to predict the best performance obtainable for this loop on the UltraSPARC T2. What percentage of this theoretical peak is i) your fastest result ii) your slowest result? (You should find there is a big difference between the best and worse).
We now want to use hardware counters to study cache access characteristics. You can change the events to be calculated by calling hwevts_setEvt(index). Look at hwevts.c to find the indices which may help you determine L1 and L2 Cache miss/hit rates. At this point, you should realise that you need performance counts for a number of different events in order to come up with the miss/hit rates (hwevts_print() is an example of how to go about doing this).
  1. As such, state which performance counters you would use in order to determine the the level 1 data cache and level 2 or "ecache" hit and miss rate?
  2. Complete the following table:
    
    (Modify your shell/awk command from above)
        
          ------------------------------------------------------
                                      Level 1 D-Cache Reads
          nsize     nrpt         ref      misses        hits
          ------------------------------------------------------
          100	1000000 
          100 	1000000 
          100 	1000000 
          
          200	100000 
          200 	100000 
          200 	100000 
    
          400 	100000 
          400 	100000 
          400 	100000 
          
          800	100000 
          800 	100000 
          800 	100000 
          
          1000	10000 
          1000 	10000 
          1000 	10000 
          
          10000	1000 
          10000 	1000 
          10000 	1000 
    
          100000 	100 
          100000 	100 
          100000 	100 
          
          1000000	10 
          1000000 	10 
          1000000 	10 
          ------------------------------------------------------
    
          ------------------------------------------------------
                                      Level 2 D-Cache Access
          nsize     nrpt         ref      misses        hits
          ------------------------------------------------------
          100	1000000 
          100 	1000000 
          100 	1000000 
          
          1000	10000 
          1000 	10000 
          1000 	10000 
          
          10000	1000 
          10000 	1000 
          10000 	1000 
    
          100000 	100 
          100000 	100 
          100000 	100 
          
          1000000	10 
          1000000 	10 
          1000000 	10 
          ------------------------------------------------------
    
    

Cache Line Size

Objective - to show how careful arrangement of data can lead to much better cache usage and improved performance

Program align.c defines an array of the following struct:
    struct vector {
     int number;
     double x0, y0, z0;
     double x1, y1, z1;
     struct vector *previous;
    } ll[NUM];
it then proceeds to measure the cycles required to perform the following operation
    for ( i = 0; i < NUM; i++) {
      sum += fabs(ll[i].x0 - ll[i].x1);
    }
repeating this loop LOOPS times and printing out the minimum and maximum number of cycles used. Look at the code and make sure you uderstand what it is doing.

Compile align.c by typing make align. Run the code by typing align. There is no input.

  1. Run the code several times. (Use the repeat command you used above). What is the minimum number of cycles required?
  2. The code contains the line
        #pragma align 16 (ll)
        
    what does this line do? (Guess!) Perhaps change "16" to "8" and see what happens.
  3. Add code to printout the address of each of the following
           ll[k].number
           ll[k].x0  ll[k].y0  ll[k].z0
           ll[k].x1  ll[k].y1  ll[k].z1
           ll[k].previous
        
    for k=0,1,1023 and 1024. What are these addresses?
  4. Change the definition of vector to be
        struct vector {
         double x0, x1,
         double y0, y1;
         double z0, z1;
         int number; 
         struct vector *previous;
        } ll[NUM];
    
    what is the minimum cycle time now? Explain why this is different to the above results?

Cache Associativity

Objective - to see how cache line associativity can directly impact observed performance

Similar to align.c program assoc.c defines an array of struct:
    struct {
     double x1, x2, x3, x4;
    } ll[4*NUM];
and then proceeds to measure the cycles required to perform the following operation
 for (iter = 0; iter < NUM; iter+=(stride*5)) {
    int lim = iter + stride;
    for ( i = iter; i < lim; i++) {
      ll[i].x1 = ll[i].x1 + ll[i+1*stride].x1 + ll[i+2*stride].x1
                  + ll[i+3*stride].x1 + ll[i+4*stride].x1;
      ll[i].x2 = ll[i].x2 + ll[i+1*stride].x2 + ll[i+2*stride].x2
                  + ll[i+3*stride].x2 + ll[i+4*stride].x2;
      ll[i].x3 = ll[i].x3 + ll[i+1*stride].x3 + ll[i+2*stride].x3
                  + ll[i+3*stride].x3 + ll[i+4*stride].x3;
      ll[i].x4 = ll[i].x4 + ll[i+1*stride].x4 + ll[i+2*stride].x4
                  + ll[i+3*stride].x4 + ll[i+4*stride].x4;
    }
}
repeating this operation LOOPS times and printing out the minimum and maximum number of cycles used. The value of stride (corrected to be between 1 and NUM-1) is read as command line input.

Compile assoc.c by typing gmake assoc. Run the code by typing assoc stride .

  1. Complete the following table:
          ---------------------------------
                    Minimum Cycles/Instr
          Stride  Exp1    Exp2    Exp3
          ---------------------------------
             1
             2
    	 4
    	16
            32
            64
           128
           256
           512
          ---------------------------------
    
    
  2. You should find that the performance decreases markedly for the last 4 cases. Explain why this is the case?
  3. Perform additional measurements with stides close to 256, eg stride=253,254,255,257... etc. When exactly does the performance change markedly from good to bad or vice versa?
  4. You might have expected the above transition only to occur for stride=256 or multiples thereof - why? Why does it occur for 64 and 128?
  5. Recode this loop to remove this problem. (HINT you want to persuade the compiler to move all data from a cache line into registers before that cache line is evicted!)
If you have a little time...
  • Use performance counters in assoc.c to investigate level 2 cache activity and hopefully to confirm the reasons you have just given for the poor performance! Since the L2 cache is 4MB and 16 way set associative, some changes to assoc.c may also be necessary to yield interesting cache behaviour.

COMP6464 Only : Stream

Objective - investigate a widely used 3rd party benchmark for measuring memory bandwidth

Stream is a widely used memory bandwidth test. Go to the source code directory at the Stream web site and download the follow file:

   stream.c
  1. Read the header to the stream.c and try compiling the code accordingly. Now run the code on wallaman and partch, and complete the following table. What compiler options did you use on both machines?
                              MACHINE 1 - Wallaman
        -------------------------------------------------------------
        Function      Rate (MB/s)   RMS time     Min time     Max time
        Copy:      
        Scale:     
        Add:       
        Triad:     
                              MACHINE 2 - Partch
        -------------------------------------------------------------
        Function      Rate (MB/s)   RMS time     Min time     Max time
        Copy:      
        Scale:     
        Add:       
        Triad:     
    
  2. The Copy operation corresponds to:
           for (i = 0; i < LENGTH; i++) a[i] = b[i];
        
    write similar expressions for Scale, Add and Triad
  3. Given you knowledge of computer hardware would you expect Stream to report any difference in the MB/s rate for each of the above operations (Copy, Scale, Add, Triad)? Provide justification for you answer.
  4. Now recompile the code on Wallaman using the -fast option. Record new data (it should be better than before).
                              MACHINE 1 - Wallaman
        -------------------------------------------------------------
        Function      Rate (MB/s)   RMS time     Min time     Max time
        Copy:      
        Scale:     
        Add:       
        Triad:     
    
  5. Based on the MB/s results obtained using the -fast compiler flag compute MFLOP rates for the Add and Triad results assuming we are working with 8byte data quantities. Show how you compute these rates. How do these results compare with the theoretical peak?
  6. The stream performance of Wallaman is rather poor. How would you modify the stream benchmark in order to give a more reasonable assessment of Wallaman's memory bandwidth?
  7. From the web site work out the following
    • The Pentium IV system that gives the fastest single processor performance. Compare this with your results on partch.
    • What stream preformance might you expect for the APAC national facility SGI system
    • The machine that gives the best single processor results