Skip Navigation | ANU Home | Search ANU | Search FEIT | Feedback
The Australian National University
Faculty of Engineering and Information Technology (FEIT)
Department of Computer Science
Printer Friendly Version of this Document
High Performance Scientific Computing COMP3320

COMP3320/6464 2008: Laboratory 1

Performance Benchmarking

The aim of this lab is to give you some experience of timing code, performance measurement and benchmarking

Unless explicitly stated this lab will be using machine IWAKI


First login to Iwaki. Then creat a directory and copying all the relevant programs into that directory:

  ssh iwaki
  mkdir lab1
  cd lab1
  cp -r /dept/dcs/comp3320/public/lab1/* .


Lab Assessment

As mentioned in the course assessment, labs are not marked, but the content is examinable. I will place model answers on the lab web page in a week or so time.

If you copied the files as above you will have a template answer file called:

COMP3320_LAB1.txt

Where relevant write answers in plain text in this file. For the COMP6464 students there are a few additional questions.


Timer Resolution: Elapsed Time

Objective - to measure the overhead and resolution of two elapsed timers

For benchmarking and performance measuring it is important to have some knowledge of the resolution and overhead of your timing routine. As mentioned in lectures we can obtain this information by making repeated calls to the timer routine. Program walltime.c does this for the C function gettimeofday. Inspect the code and make sure you understand what it is doing. Then compile the code by issuing the command make walltime
  1. What is your estimate for the resolution (Tr) and overhead (Tc) of gettimeofday on Iwaki. (You may need to run the code a few times).
  2. Now repeat the above measurements but on your desktop machine. You will need to compile the code using gcc.
On Iwkai another routine measuring elapsed time is available in the library libunknown.a. It is called elapsetime and has the following prototype:
        void elapsetime(int* isec, int* iusec)
where argument isec returns the current second component of the elapsed time while argument iusec returns the current microsecond (10-6) component (similar to gettimeofday).
  1. On Iwaki copy walltime.c to mywalltime.c and modify it to determine the overhead and resolution of elapsetime. What are the values of (Tr) and overhead (Tc) for this routine? (Note the makefile will compile and link your file mywalltime.c with libunknown.a correctly).
  2. Keep a copy of your code as mywalltime.c.

CPU timing and Scaling Properties

Objective - to measure CPU time and appreciate computational scaling

Program linear.c sets up a system of linear equations and solves them using Gaussian elimination. That is we are solving a set of equations of the form
Ax = b
where A is a square matrix of dimension n and b is a vector of length n. The elements of both A and b are known, and the aim is to determine the elements of x. This sort of problem is very common in computational science, and forms the basis of the metric that is used to rank the top500 most powerful supercomputers. For the purpose of this lab it is not important for you to understand the details of Gaussian elimination, just to appreciate that the problem to be solved depends in some manner on the input dimension n.

On Iwaki compile linear.c by typing make linear. Run the code by giving the problem size as a command line argument. Start with a small value of n, say n=10

    linear 10
Run the code for a few different input dimensions, e.g. 10, 100, 200. You will see that the code reports the CPU, SYS and CPU+SYS time.
  1. What units are the CPU, SYS and CPU+SYS time reported in? (You should make an educated guess based on the sort of time the program takes to run - eg does it seem to take hours to run! Then look at the code and read the relevant man pages).
  2. Briefly state what you understand to be the meaning of CPU time and SYS time.
  3. We can time programs using the Unix time utility (do man -s1 time for details). Use this utility and complete the following table comparing the results from time with the CPU+SYS time. Run 3 experiments for each data size, or more if your results are inconsistent between runs. (You might want to consider doing something like "repeat 7 time linear 128 | egrep '(CPU\+SYS|%)' ")
         -------------------------------------------------------------------
                              --------------Input Dimension----------
                          80    80    80   160   160   160   320   320   320
         -------------------------------------------------------------------
         via code
             CPU+SYS
         via time 
                real
                user
                 sys
         -------------------------------------------------------------------
                              --------------Input Dimension----------
                         640   640   640   1280 1280 1280
         -------------------------------------------------------------------
         via code
             CPU+SYS
         via time 
                real
                user
                 sys
         -------------------------------------------------------------------
    
  4. Why are the results from CPU+SYS different from the time results (you may need to inspect the code to confirm your thoughts).
  5. Using the above data, plus perhaps additional timing runs, deduce how the computational cost of linear varies with the input dimension n. You must include experimental justification for your answer. (We are looking for an expression of the form O(nk) where k is an integer).

Code Profiling

Objective - to understand how to profile code and analyze the results

Program linear.c contains the following routines:
  void   main
  void   iprtmatrix
  void   prtmatrix
  void   gauss_elim
  void   largest
  void   elim
  void   pivot
  void   subst
  double cpu_time
In addition the executable will make use of some system libraries, e.g. drand48 is used to provide pseudo random numbers. We wish now to determine what routine(s) are consuming most of the CPU time. We will do this using gprof

Compile linear.c with profiling enabled. This is done using the -xpg compiler option. The Makefile has been modified to easily do this by typing

   make linear_prof
Now run the profile enabled version of the code, e.g.
   linear_prof 400
if you look in your directory you will find that a file gmon.out has been created. Analyze this by typing
   gprof linear_prof
It will produce a lot of output that you may want to redirect into a file (e.g. to put output in a file called file.profile
   gprof linear_prof >& file.profile 
Alternatively you can limit the size of the profile via the -n option to gprof:
   gprof -n10 linear_prof 
Profile the code for n=800
  1. Which routine is the dominant routine and what fraction of the total CPU is it taking?
  2. How many times are the routines elim, pivot and largest called?
  3. Use the profile output to construct a calling graph for those routines contained in linear.c, i.e. show how each routine is called and approximately how much time is spent in a given routine when it is called from the parent routine. You should verify your calling graph by inspecting the code.

We will now profile the code using basic block profiling with tcov. This will tell us how many times different lines of code are actually executed. To profile the code in this manner we must compile with the option xprofile=tcov. The Makefile has been conveniently modified to do this, type:-

   make linear_tcov
Now run the tcov enabled version of the code, e.g.
   linear_tcov 400
if you look in your directory you will find that you now have a directory linear_tcov.profile. To obtain the coverage information we must provide tcov with the name of this directory and the relevant source code file, e.g.
   tcov -x linear_tcov.profile linear.c
(do man tcov for more information). This produces an output file linear.c.tcov.

Use tcov profiling for n=400

  1. Which line of the code is executed the most times (cut out the source code and report the number of times this line is executed).
  2. Is the location of this line consistent with your gprof profiles?
  3. How does the number of times this line is executed change as you vary the problem size? (NOTE, between tcov runs you should remove any tcov data files that relate to the previous run - else your results will be a combination of all previous tcov runs). Is this consistent with your earlier conclusions concerning the overall scaling of the code. Is it also consistent with the loop structure around this line? (You may want to run tcov profiling for a variety of problem sizes).
  4. What other lines in the code have significant execution counts and how do they scale with the input problem size?
  5. How many basic blocks does the code contain, and how many of these were actually executed?
  6. tcov is a Sun/Solaris utility. How do you perform equivalent profiling on linux with gcc compilers?

Finally, Sun provides a rather fancy profiling tool called collector and analyzer. This will provide profiling similar to gprof but without having to recompile the code. Unfortunately, the graphical interface is not installed on the student system - but using another machine I will demonstrate this to you in labs. We can, however, still use this profiler via its command line interface. As a simple demonstration type

   collect linear 400
this will create a profile directory called test.1.er or similar (numbers increment each time). To analyze the profile you can type
   er_print -function test.1.er
this will print the top routines and the time spent in each.
  1. Read the man page for er_print. How would you use er_print to obtain details on which routine called which, ie a calling graph similar to that from gprof? (Try this to make sure that it works).

Analyzer Demo

At this point I will show you use of the analyzer tool. This is a GUI to view the results of using collect. It is on a different machine.

For COMP6464/Honours Students (and others if desired) : Machine Information

Objective - find information out about the hardware we are using!

In lectures we have discussed RISC hardware. Often exactly which algorithm is used in a high performance computing application will depend on the exact configuration of the platform being used. For both iwaki and partch
  1. What is the name of the machine you are considering?
  2. What Operating system and what version of it is running. How did you tell?
  3. What speed processor (MHZ) is on this machine and how did you determine this information?
  4. How much memory is on this machine and how did you determine this?
  5. How many CPUs does it have and how did you determine this?
  6. What is size of the level 1 data cache and level 2 cache (also known as ecache or external cache) and how did you gain this information?
You will find the following:- uname, dmesg, fpversion, cpuinfo, meminfo and prtconf to be useful starting points!