COMP3320/6464 2010: Laboratory 1
Performance Benchmarking
The aim of this lab is to give you some experience of timing
code, performance measurement and benchmarking
Unless explicitly stated this lab will be using machine PARTCH
First download lab1.tar.gz, unpacking it to an appropriate directory. Then login to partch as follows:-
ssh -X partch
Lab Assessment
As mentioned in the course assessment, labs are not marked, but the
content is examinable. I will place model answers on the lab web page
in a week or so time.
If you copied the files as above you will have a template answer file
called:
COMP3320_LAB1.txt
Where relevant write answers in plain text in this file.
For the COMP6464 students there are a few additional
questions.
Timer Resolution: Elapsed and CPU Time
Objective - to measure the overhead and resolution of two elapsed timers
For benchmarking and performance measuring it is important to have
some knowledge of the resolution and overhead of your timing
routine. As mentioned in lectures we can obtain this information by
making repeated calls to the timer routine. Program
walltime.c does this for the C function
gettimeofday.
Inspect the code and make sure you understand what it is doing. Then
compile the code by issuing the command
make walltime
- What is your estimate for the resolution (Tr) and
overhead (Tc) of gettimeofday on Partch.
(You may need to run the code a few times).
- Now repeat the above measurements but on your desktop machine.
You will need to compile the code using gcc.
- What modifications to walltime.c must be made in order to
estimate the resolution and overhead of the process time? Can we
simply replace gettimeofday with something else?
CPU timing and Scaling Properties
Objective - to measure CPU time and appreciate computational scaling
Program
linear.c sets up a system of linear equations and
solves them using Gaussian elimination. That is we are solving a set
of equations of the form
Ax = b
where
A is a square matrix of dimension
n and
b is a vector of length
n. The elements of both
A and
b are known, and the aim is to determine
the elements of
x. This sort of problem is very common in
computational science, and forms the basis
of the metric that is used to rank the
top500 most powerful
supercomputers. For the purpose of this lab it is not
important for you to understand the details of Gaussian elimination,
just to appreciate that the problem to be solved depends in some manner
on the input dimension
n.
On Partch compile linear.c by typing make linear. Run the code
by giving the problem size as a command line argument. Start with a
small value of n, say n=10
linear 10
Run the code for a few different input dimensions, e.g. 10, 100, 200.
You will see that the code reports the CPU, SYS and CPU+SYS time.
- What units are the CPU, SYS and CPU+SYS time reported in?
(You should make an educated guess based on the sort of time the
program takes to run - eg does it seem to take hours to run!
Then look at the code and read the relevant man pages).
- Briefly state what you understand to be the meaning of CPU time
and SYS time.
- We can time programs using the Unix time utility (do
man -s1 time for details). Use this utility and complete
the following table comparing the results from time with
the CPU+SYS time. Run 3 experiments for each data size,
or more if your results are inconsistent between runs. (You might
want to consider doing something like "repeat 7 time linear 128 |
egrep '(CPU\+SYS|%)' ")
-------------------------------------------------------------------
--------------Input Dimension----------
80 80 80 160 160 160 320 320 320
-------------------------------------------------------------------
via code
CPU+SYS
via time
real
user
sys
-------------------------------------------------------------------
--------------Input Dimension----------
640 640 640 1280 1280 1280
-------------------------------------------------------------------
via code
CPU+SYS
via time
real
user
sys
-------------------------------------------------------------------
- Why are the results from CPU+SYS different from the
time results (you may need to inspect the code to confirm
your thoughts).
- Using the above data, plus perhaps additional timing runs, deduce
how the computational cost of linear varies with the input
dimension n. You must include experimental justification
for your answer. (We are looking for an expression of the form
O(nk) where k is an integer).
Code Profiling
Objective - to understand how to profile code and analyze the results
Program
linear.c contains the following routines:
void main
void iprtmatrix
void prtmatrix
void gauss_elim
void largest
void elim
void pivot
void subst
double cpu_time
In addition the executable will make use of some system libraries,
e.g.
drand48 is used to provide pseudo random numbers. We
wish now to determine what routine(s) are consuming most of the CPU
time. We will do this using
gprof
Compile linear.c with profiling enabled. This is done using
the -pg compiler option. The Makefile has been
modified to easily do this by typing
make linear_gprof
Now run the profile enabled version of the code, e.g.
linear_gprof 400
if you look in your directory you will find that a file
gmon.out has been created. Analyze this by typing
gprof linear_gprof
It will produce a lot of output that you may want to redirect into a
file (e.g. to put output in a file called
file.profile
gprof linear_gprof >& file.profile
Alternatively you can limit the size of the profile via the
-b option to
gprof, which disables the printing of the blurb:
gprof -b linear_gprof
Profile the code for
n=800
- Which routine is the dominant routine and what fraction of the
total CPU is it taking?
- How many times are the routines elim, pivot and
largest called?
- Use the profile
output to construct a calling graph for those routines contained
in linear.c, i.e. show how each routine is
called and approximately how much time is spent in a given routine
when it is called from the parent routine. You should
verify your calling graph by inspecting the code.
We will now profile the code using basic block profiling with
gcov. This will tell us how many times different lines of
code are actually executed. To profile the code in this manner we
must compile with the option -fprofile-arcs -ftest-coverage. The
Makefile has been conveniently modified to do this, type:-
make linear_gcov
Now run the gcov enabled version of the code, e.g.
linear_gcov 400
if you look in your directory you will find the
files
linear.gcda and
linear.gcno, these files contain
information needed by gcov. To obtain the coverage information we
must provide gcov with the name of the source code file
gcov linear.c
(do
man gcov for more information). This produces an output
file
linear.c.gcov.
Use gcov profiling for n=400
- Which line of the code is executed the most times (cut out the
source code and report the number of times this line is executed).
- Is the location of this line consistent with your gprof profiles?
- How does the number of times this line is executed change as you
vary the problem size? (NOTE, between gcov runs you should remove
any .gcda data files that relate to the previous run - else your
results will be a combination of all previous gcov runs).
Is this consistent with your earlier
conclusions concerning the overall scaling of the code. Is it also
consistent with the loop structure around this line? (You may want
to run gcov profiling for a variety of problem sizes).
- What other lines in the code have significant execution counts and
how do they scale with the input problem size?
- How many basic blocks does the code contain, and how many of
these were actually executed?
Profiling using Collecter and Analyzer
Objective - To profile code using more advanced GUI based tools
Finally, Sun provides a rather fancy GUI profiling tool called collector
and analyzer. This will provide profiling similar to gprof but without
having to recompile the code. Since partch is not a Sun machine, to
demonstrate collector/analyzer we will instead use a Sparc
machine,
wallaman. That is in
ssh -X wallaman. You
will need to use
gmake instead of
make in order to use
the existing Makefile.
To collect performance data you will notice that the -g is
used in the Makefile. To build the executable, type:-
/usr/sfw/bin/gmake linear_sun
Then to collect the profiling data for linear for a problem size of 1,000, type
collect ./linear_sun 1000
this will create a profile directory called
test.1.er or
similar (numbers increment each time). By default, clock-based profiling (similar to gprof) will be performed. For more advanced profiling options involving performance counters, heaps, OpenMP, MPI, etc, consult the man page for
collector. Another option is to use the GUI interface to collect the
profiling data instead of using the command line.
To analyze the profile on the command line you can type
er_print -function test.1.er
this will print the top routines and the time spent in each.
To analyze the profile on the GUI (which we suggest you use when
possible) you can type
analyzer
the program will ask you to open an experiment. The experiment in this
case is the directory containing the profile, ie
test.1.er
etc. The GUI allows you to view different aspects of the profile right
down to the instruction level.
- The functions tab shows the most
time consuming functions in the code.
- The sources and disassembly tabs highlights the most time consuming lines
of code, and assembly instructions,
respectively.
- The callers-callees shows the call graph of the
function.
- Finally, the timeline tab shows the profiling envents
in chronological order.
Be aware that the data presented in each of these tabs vary depending on
the profile. In this case, we are doing clock-based profiling which
essentially measures the time.
- Take some time to explore the features of analyzer. Contrast it with your experiences with gprof and gcov.
- Create a new experiment which profiles the number of data cache
miss events (D$ Misses) during the execution of linear_sun.
Consult the following link for documentation on collect/analyzer:
http://developers.sun.com/sunstudio/overview/topics/analyzing.jsp
For COMP6464/Honours Students (and others if desired) : Machine
Information
Objective - find information out about the hardware we are using!
In lectures we have discussed RISC hardware. Often exactly which
algorithm is used in a high performance computing application will
depend on the exact configuration of the platform being used. For both
partch
- What is the name of the machine you are considering?
- What Operating system and what version of it is running. How did
you tell?
- What speed processor (MHZ) is on this machine
and how did you determine this information?
- How much memory is on this machine and how did you determine this?
- How many CPUs does it have and how did you determine this?
- What is size of the level 1 data cache and level 2 cache (also
known as ecache or external cache) and how did you gain this information?
You will find the following:-
uname,
dmesg,
fpversion,
cpuinfo,
meminfo and
prtconf to be useful starting points!