CECS Home | ANU Home | Search ANU
ANU College of Engineering and Computer Science
Research School of Computer Science
Printer Friendly Version of this Document

UniSAFE

High Performance Scientific Computing

COMP3320/6464 2012: Laboratory 2

Hardware Architecture and Code Performance

The aim of this lab is to understand how hardware architecture can dramatically affect the performance of your code.

ATTN: You must log on to xe.nci.org.au to complete this lab

The SGI XE cluster (xe.nci.org.au) is a part of the National supercomputing infrastructure here in Canberra.

The machine xe.nci.org.au is a cluster based on two socket quad-core Intel Xeon E5462 processors nodes. These processors are equipped with 32KB data L1 cache and 2x6MB of L2 cache (two cores share 6MB). The CPUs run at 2.8GHz. Every node has 16GB of RAM memory. Nodes are connected by 20Gb Infiniband. More details about this machine can be found here.

Start by creating a directory for lab2 and copying all the content in the following tar file on to that directory: lab2.tar.gz


Lab Assessment

You are not required to submit labs, but all lab material is examinable! You are therefore advised to record you answers to each question.

If you copied the files as above you will have a template answer file called:

COMP3320_LAB2.txt

Where relevant write answers in plain text in this file.

  • COMP3320_LAB2.txt: text answers to questions below


Access to xe.nci.org.au

Objective - get access to XE, setup your profile, set environment

First of all, we need to learn how to connect to XE, how to copy data on this system and how to load appropriate SW modules. This system has a few particularities and policies we have to be aware of.

First login to XE:

  1. Ask Jiri to give you a card with your XE login name.
  2. You are required to complete your user details. Open NCI account web page and use the Update Contacts and Researcher Track Record Details form found near the bottom.
    a) Log in using your XE login name
    b) Enter your full name in the first name field (you cannot change the surname field)
    c) Enter your email address
    d) Enter your phone number
    e) Ignore anything else
    Now you should receive an email from NCI confirming your profile update.
  3. Use ssh -X your_xe_login@xe.nci.org.au to log on to XE.
  4. Change your password immediately. Use passwd

Connect XE home directory:

XE has its own disk system. Your home directory is under /home/659/your_xe_login, where 659 is id of CECS ANU at NCI. You can map your home directory by means of ssh or ftp.
  1. Open your Ubuntu main menu and select Connect to server...
    a) Server: xe.nci.org.au
    b) Type: ssh
    c) Folder: /home/659/your_xe_login
    d) Use your XE login and password
    e) You can add a bookmark in your system.
  2. Copy the assignment of this lab into this folder

XE software and quota:

When you log on to XE, you are provided with a basic environment. If you need (and you will) you can load other SW modules. Related commands:
  1. module list shows all modules loaded in memory.
  2. module load MODULE_NAME loads a chosen module in memory.
  3. module unload MODULE_NAME unload a chosen module from memory.
You are restricted by this quota, type nf_limits.
  1. Maximum walltime per application run is 24 minutes.
  2. Maximum RAM memory is 1GB.
  3. All COMP3320 users share 100 CPU hours. Please use XE wisely!

Hardware Performance Counters

Objective - to demonstrate the use of hardware performance counters to gain very accurate performance information

The Intel Xeon processors are equipped with a few registers (4 to 8) that can count different performance counters. These counters are related to different subsystems. You can inspect e.g. number of executed integer, float, branch and memory instructions as well as number of accesses to different cache memories.

There are several libraries providing access to these low-level metrics, such as PAPI. We are going to use this library to investigate the performance of simple codes.

  1. Transfer lab2.tar.gz to XE cluster.
  2. What machine (use hostname) are your running on? If it is not xe log out and log on to xe.nci.org.au!
  3. Is PAPI module in main memory. If it is not, load PAPI module?
PAPI provides useful command line tools (papi_avail, papi_cost, papi_mem_info, papi_clockers, papi_version, etc.):
  1. What version of PAPI did you load into XE memory (papi_version)?

  2. What is the frequency of XE's CPUs (papi_avail)?
  3. How many events can you run simultaneously and multiplexed (papi_avail)?
  4. How many events are available on XE (papi_avail)?
  5. What is the name of the event that counts number of CPU clock cycles (papi_avail)?

  6. Write down the sizes of L1 and L2 caches (papi_mem_info)?
  7. For BOTH the L1 and L2 caches what are their i) sizes, ii) cache line sizes, iii) associativity?

  8. The overhead of PAPI can be measured by (papi_cost)? What is the overhead of i) starting PAPI, ii) reading a PAPI counter and iii) resetting a PAPI counter in terms of clock cycles and seconds?
  9. Is the overhead of PAPI comparable with gettimeofday we used in last lab?

Simplified PAPI C interface

PAPI also provides C interface (papi.h). The documentation for the high-level C functions is here. We will use this library to measure the performance of three different loops using a simple PAPI interface and an advanced one.

The simplified interface is composed of this PAPI function:

#include <papi.h>  
int PAPI_flops( float * rtime, float * ptime, long long * flpops, float * mflops );

The first call PAPI_flops() will initialise the PAPI High Level interface, set up the counters to monitor PAPI_FP_OPS and PAPI_TOT_CYC events and start the counters. Subsequent calls will read the counters and return total real time (walltime), total process time, total floating point operations since the start of the measurement and the Mflop/s rate since latest call to PAPI_flops().

We will use this simplified interface to measure performance of three different loops in terms of MFLOPs (Million floating point operations per second). The program works over two double arrays with nsize elements. This loop is repeated nrpt times.

Take a look at the following code and make sure you understand what it is dooing.


  sum = 0.0;
  PAPI_flops(&real_time, &proc_time, &flpins, &mflops);

  for (int j = 0; j < nrpt; j++) {
    for (int k = 0; k < nsize; k++){
	      //-- loop one --//
       sum = sum + (a[k]*b[k]);

             //-- loop two --//
            //sum = sum + a[k] * (a[k]+b[k]);

            //-- loop three --//
            //  if (a[k] > 0.5) sum = sum + (a[k]*b[k]);
            //  else            sum = sum - (a[k]*b[k]);

     }// for k
  }// for j

  PAPI_flops(&real_time, &proc_time, &flpins, &mflops);   

Now, you first have to load an appropriate compiler module. Type module load gcc/4.4.4. Verify the module has been loaded successfully (type module list). Go into directory gcount and make gcount.cpp typing make. The program gcount takes as command line input two parameters, nsize and nrpt. nsize is the length of the vectors a and b.

We are now going to use hardware performance counters to measure the performance of the i)sum=sum+a[k]*b[k] and ii) sum = sum + a[k]*(a[k]+b[k])loop as a function of vector length using a number of repetitions.
  1. Complete the following table.
       
          ---------------------------------------------------------------------
                                        Loop 1        Loop 2 
          nsize nrpt    Data size[B]    MFLOPS        MFLOPS    
          ---------------------------------------------------------------------
          100 10000000 
          1000 1000000 
          10000 100000       
          100000 10000       
          1000000 1000
          10000000 100
         
          ----------------------------------------------------------------------
    
  2. Why does the loop 2 reach better performance?
  3. In the above table you should find that with increasing nsize the performance initially improves, but then gets worse. Explain why this is.

We are now going to use hardware performance counters to measure the performance of the loop 3. This loop inspects the value of each a[k] element and based on the value updates the final sum. The values of a were generated by a uniform random generator from interval of <0,1>. We will change the threashold value from 0.5 to 0.999 and watch how this influences the performance.
You have to modify the value in source code and rebuild it. Use nsize = 10000 and nptr = 10000.

  1. Complete the following table.
      
          ----------------------------------------
                                   Loop 3    
          Threshold value          MFLOPS    
          ----------------------------------------
          0.5
          0.66
          0.75
          0.9
          0.95
          0.99
          0.999
          ----------------------------------------
  2. How would you explain these results?

High level PAPI C interface

Apart from the simplified interface, PAPI provides a comprehensive High Level API functions, see this page. This interface allows you do define a new event subset composed of PAPI_events (see the output of papi_avail).

An example of high level PAPI function is given in directory gcount_pro. First, we need to define a new event set and an array holding the values for selected counters. Bellow, you can see an event set Events containing two events: PAPI_FP_OPS and PAPI_TOT_CYC. The second array will maintain the values for these two counters.

    const int NUM_EVENTS = 2;
    int Events[NUM_EVENTS] = {PAPI_FP_OPS, PAPI_TOT_CYC};
    long long Values[NUM_EVENTS]; 

Now we have to call a sequence of following PAPI functions, details can be found here:

    1. PAPI_library_init(PAPI_VER_CURRENT)      - This function initializes PAPI library. 
    2. PAPI_start_counters(Events, NUM_EVENTS)  - This function takes the even set and start the counters.
    3. RUN THE CRITICAL LOOP 
    4. PAPI_stop_counters(Values, NUM_EVENTS))  - This functions stops the counters and reads their values into array values 

Build gcount_pro by typing make. This version takes the same parameters as gcount. Now run gcount_pro and check the output. Make sure you understand the output.

  1. For one thread, the Intel Xeon processor can initiate two floating point operations (addition and multiplication) each clock cycle. Given this how many cycles might you expect the above loop to take?
  2. If the line
       sum = sum + a[k]*b[k]; 
    were replaced by the line
       sum = sum + a[k]*(a[k]+b[k]);
  3. Complete the following table.
       
          ------------------------------------------------------------------------
                                          Loop 1             Loop 2 
          nsize nrpt    Data size[B]    instr/cycle        instr/cycle    
          ------------------------------------------------------------------------
          100 10000000 
          1000 1000000 
          10000 100000       
          100000 10000       
          1000000 1000
          10000000 100
         
          ----------------------------------------------------------------------
    
  4. Open the Makefile and change compiler optimizations level form -O3 to -O1, rebuild the code typing make clean and make and measure the first and the last row again. Is there any difference?
    Do not forget to put back -O3 when finished.

The last task for you is to investigate L1 and L2 data cache misses and compare them with the total number of L1 and L2 data accesses. For this task, you first need to decide what PAPI counters to use. Then, you have to modify the source code and rebuild it.

  1. Which performance counters you would use in order to determine the the level 1 data cache and level 2 miss rate?
  2. Complete the following table. Work only with the first loop.
       
          -----------------------------------------------------------+-------------------------------                        
                                                L1 data cache        |          L2 data cache
          nsize nrpt    Data size[B]    misses   accesses  miss rate |   misses   accesses  miss rate
          -----------------------------------------------------------+-------------------------------
          100 10000000 
          1000 1000000 
          10000 100000       
          100000 10000       
          1000000 1000
          10000000 100     
          -----------------------------------------------------------------------------------------
    
  3. Do the L1 and L2 miss rates correspond to MFLOP drops in the previous tables?
  4. Open the Makefile and change compiler optimizations level form -O3 to -O1, rebuild the code typing make clean and make and measure the first and the last row again. Is there any difference?

COMP6464 Only : Stream

Objective - investigate a widely used 3rd party benchmark for measuring memory bandwidth

Stream is a widely used memory bandwidth test. Go to the source code directory at the Stream web site and download the follow file:

   stream.c
  1. Read the header to the stream.c and try compiling the code accordingly. Now run the code on xe and partch, and complete the following table. What compiler options did you use on both machines?
                              MACHINE 1 - xe.nci.org.au
        -------------------------------------------------------------
        Function      Rate (MB/s)   RMS time     Min time     Max time
        Copy:      
        Scale:     
        Add:       
        Triad:     
                              MACHINE 2 - partch
        -------------------------------------------------------------
        Function      Rate (MB/s)   RMS time     Min time     Max time
        Copy:      
        Scale:     
        Add:       
        Triad:     
    
  2. The Copy operation corresponds to:
           for (i = 0; i < LENGTH; i++) a[i] = b[i];
        
    write similar expressions for Scale, Add and Triad
  3. Given you knowledge of computer hardware would you expect Stream to report any difference in the MB/s rate for each of the above operations (Copy, Scale, Add, Triad)? Provide justification for you answer.