CECS Home | ANU Home | Search ANU
ANU College of Engineering and Computer Science
Research School of Computer Science
Printer Friendly Version of this Document

UniSAFE

High Performance Scientific Computing

COMP3320/6464 2010: Laboratory 1

Performance Benchmarking

The aim of this lab is to give you some experience of timing code, performance measurement and benchmarking

Unless explicitly stated this lab will be using machine PARTCH


First download lab1.tar.gz, unpacking it to an appropriate directory. Then login to partch as follows:-

  ssh -X partch


Lab Assessment

As mentioned in the course assessment, labs are not marked, but the content is examinable. I will place model answers on the lab web page in a week or so time.

If you copied the files as above you will have a template answer file called:

COMP3320_LAB1.txt

Where relevant write answers in plain text in this file. For the COMP6464 students there are a few additional questions.


Timer Resolution: Elapsed and CPU Time

Objective - to measure the overhead and resolution of two elapsed timers

For benchmarking and performance measuring it is important to have some knowledge of the resolution and overhead of your timing routine. As mentioned in lectures we can obtain this information by making repeated calls to the timer routine. Program walltime.c does this for the C function gettimeofday. Inspect the code and make sure you understand what it is doing. Then compile the code by issuing the command make walltime
  1. What is your estimate for the resolution (Tr) and overhead (Tc) of gettimeofday on Partch. (You may need to run the code a few times).
  2. Now repeat the above measurements but on your desktop machine. You will need to compile the code using gcc.
  3. What modifications to walltime.c must be made in order to estimate the resolution and overhead of the process time? Can we simply replace gettimeofday with something else?

CPU timing and Scaling Properties

Objective - to measure CPU time and appreciate computational scaling

Program linear.c sets up a system of linear equations and solves them using Gaussian elimination. That is we are solving a set of equations of the form
Ax = b
where A is a square matrix of dimension n and b is a vector of length n. The elements of both A and b are known, and the aim is to determine the elements of x. This sort of problem is very common in computational science, and forms the basis of the metric that is used to rank the top500 most powerful supercomputers. For the purpose of this lab it is not important for you to understand the details of Gaussian elimination, just to appreciate that the problem to be solved depends in some manner on the input dimension n.

On Partch compile linear.c by typing make linear. Run the code by giving the problem size as a command line argument. Start with a small value of n, say n=10

    linear 10
Run the code for a few different input dimensions, e.g. 10, 100, 200. You will see that the code reports the CPU, SYS and CPU+SYS time.
  1. What units are the CPU, SYS and CPU+SYS time reported in? (You should make an educated guess based on the sort of time the program takes to run - eg does it seem to take hours to run! Then look at the code and read the relevant man pages).
  2. Briefly state what you understand to be the meaning of CPU time and SYS time.
  3. We can time programs using the Unix time utility (do man -s1 time for details). Use this utility and complete the following table comparing the results from time with the CPU+SYS time. Run 3 experiments for each data size, or more if your results are inconsistent between runs. (You might want to consider doing something like "repeat 7 time linear 128 | egrep '(CPU\+SYS|%)' ")
         -------------------------------------------------------------------
                              --------------Input Dimension----------
                          80    80    80   160   160   160   320   320   320
         -------------------------------------------------------------------
         via code
             CPU+SYS
         via time 
                real
                user
                 sys
         -------------------------------------------------------------------
                              --------------Input Dimension----------
                         640   640   640   1280 1280 1280
         -------------------------------------------------------------------
         via code
             CPU+SYS
         via time 
                real
                user
                 sys
         -------------------------------------------------------------------
    
  4. Why are the results from CPU+SYS different from the time results (you may need to inspect the code to confirm your thoughts).
  5. Using the above data, plus perhaps additional timing runs, deduce how the computational cost of linear varies with the input dimension n. You must include experimental justification for your answer. (We are looking for an expression of the form O(nk) where k is an integer).

Code Profiling

Objective - to understand how to profile code and analyze the results

Program linear.c contains the following routines:
  void   main
  void   iprtmatrix
  void   prtmatrix
  void   gauss_elim
  void   largest
  void   elim
  void   pivot
  void   subst
  double cpu_time
In addition the executable will make use of some system libraries, e.g. drand48 is used to provide pseudo random numbers. We wish now to determine what routine(s) are consuming most of the CPU time. We will do this using gprof

Compile linear.c with profiling enabled. This is done using the -pg compiler option. The Makefile has been modified to easily do this by typing

   make linear_gprof
Now run the profile enabled version of the code, e.g.
   linear_gprof 400
if you look in your directory you will find that a file gmon.out has been created. Analyze this by typing
   gprof linear_gprof
It will produce a lot of output that you may want to redirect into a file (e.g. to put output in a file called file.profile
   gprof linear_gprof >& file.profile 
Alternatively you can limit the size of the profile via the -b option to gprof, which disables the printing of the blurb:
   gprof -b linear_gprof 
Profile the code for n=800
  1. Which routine is the dominant routine and what fraction of the total CPU is it taking?
  2. How many times are the routines elim, pivot and largest called?
  3. Use the profile output to construct a calling graph for those routines contained in linear.c, i.e. show how each routine is called and approximately how much time is spent in a given routine when it is called from the parent routine. You should verify your calling graph by inspecting the code.

We will now profile the code using basic block profiling with gcov. This will tell us how many times different lines of code are actually executed. To profile the code in this manner we must compile with the option -fprofile-arcs -ftest-coverage. The Makefile has been conveniently modified to do this, type:-

   make linear_gcov
Now run the gcov enabled version of the code, e.g.
   linear_gcov 400
if you look in your directory you will find the files linear.gcda and linear.gcno, these files contain information needed by gcov. To obtain the coverage information we must provide gcov with the name of the source code file
   gcov linear.c
(do man gcov for more information). This produces an output file linear.c.gcov.

Use gcov profiling for n=400

  1. Which line of the code is executed the most times (cut out the source code and report the number of times this line is executed).
  2. Is the location of this line consistent with your gprof profiles?
  3. How does the number of times this line is executed change as you vary the problem size? (NOTE, between gcov runs you should remove any .gcda data files that relate to the previous run - else your results will be a combination of all previous gcov runs). Is this consistent with your earlier conclusions concerning the overall scaling of the code. Is it also consistent with the loop structure around this line? (You may want to run gcov profiling for a variety of problem sizes).
  4. What other lines in the code have significant execution counts and how do they scale with the input problem size?
  5. How many basic blocks does the code contain, and how many of these were actually executed?


Intel VTune AmplifierXE 2011 Demonstration

Objective - Introduce an advantageg profiler tool

Intel® VTuneTM Amplifier XE is a powerful threading and performance optimization tool for C/C++, .NET, and Fortran developers who need to understand an application's serial and parallel behavior to improve performance and scalability.

You can download a 30 day evaluation version from the Intel's website (900MB) and run it on your onw machine. It supports Windows, Linux, and Mac OS X.
For system wide statistics you need a super-user account!

There are a few demo videos here.


For COMP6464/Honours Students (and others if desired) : Machine Information

Objective - find information out about the hardware we are using!

In lectures we have discussed RISC hardware. Often exactly which algorithm is used in a high performance computing application will depend on the exact configuration of the platform being used. For both partch
  1. What is the name of the machine you are considering?
  2. What Operating system and what version of it is running. How did you tell?
  3. What speed processor (MHZ) is on this machine and how did you determine this information?
  4. How much memory is on this machine and how did you determine this?
  5. How many CPUs does it have and how did you determine this?
  6. What is size of the level 1 data cache and level 2 cache (also known as ecache or external cache) and how did you gain this information?
You will find the following:- uname, dmesg, fpversion, cpuinfo, meminfo and prtconf to be useful starting points!