COMP3320/6464 2010: Laboratory 1
Performance Benchmarking
The aim of this lab is to give you some experience of timing
code, performance measurement and benchmarking
Unless explicitly stated this lab will be using machine PARTCH
First download lab1.tar.gz, unpacking it to an appropriate directory. Then login to partch as follows:-
ssh -X partch
Lab Assessment
As mentioned in the course assessment, labs are not marked, but the
content is examinable. I will place model answers on the lab web page
in a week or so time.
If you copied the files as above you will have a template answer file
called:
COMP3320_LAB1.txt
Where relevant write answers in plain text in this file.
For the COMP6464 students there are a few additional
questions.
Timer Resolution: Elapsed and CPU Time
Objective - to measure the overhead and resolution of two elapsed timers
For benchmarking and performance measuring it is important to have
some knowledge of the resolution and overhead of your timing
routine. As mentioned in lectures we can obtain this information by
making repeated calls to the timer routine. Program
walltime.c does this for the C function
gettimeofday.
Inspect the code and make sure you understand what it is doing. Then
compile the code by issuing the command
make walltime
- What is your estimate for the resolution (Tr) and
overhead (Tc) of gettimeofday on Partch.
(You may need to run the code a few times).
- Now repeat the above measurements but on your desktop machine.
You will need to compile the code using gcc.
- What modifications to walltime.c must be made in order to
estimate the resolution and overhead of the process time? Can we
simply replace gettimeofday with something else?
CPU timing and Scaling Properties
Objective - to measure CPU time and appreciate computational scaling
Program
linear.c sets up a system of linear equations and
solves them using Gaussian elimination. That is we are solving a set
of equations of the form
Ax = b
where
A is a square matrix of dimension
n and
b is a vector of length
n. The elements of both
A and
b are known, and the aim is to determine
the elements of
x. This sort of problem is very common in
computational science, and forms the basis
of the metric that is used to rank the
top500 most powerful
supercomputers. For the purpose of this lab it is not
important for you to understand the details of Gaussian elimination,
just to appreciate that the problem to be solved depends in some manner
on the input dimension
n.
On Partch compile linear.c by typing make linear. Run the code
by giving the problem size as a command line argument. Start with a
small value of n, say n=10
linear 10
Run the code for a few different input dimensions, e.g. 10, 100, 200.
You will see that the code reports the CPU, SYS and CPU+SYS time.
- What units are the CPU, SYS and CPU+SYS time reported in?
(You should make an educated guess based on the sort of time the
program takes to run - eg does it seem to take hours to run!
Then look at the code and read the relevant man pages).
- Briefly state what you understand to be the meaning of CPU time
and SYS time.
- We can time programs using the Unix time utility (do
man -s1 time for details). Use this utility and complete
the following table comparing the results from time with
the CPU+SYS time. Run 3 experiments for each data size,
or more if your results are inconsistent between runs. (You might
want to consider doing something like "repeat 7 time linear 128 |
egrep '(CPU\+SYS|%)' ")
-------------------------------------------------------------------
--------------Input Dimension----------
80 80 80 160 160 160 320 320 320
-------------------------------------------------------------------
via code
CPU+SYS
via time
real
user
sys
-------------------------------------------------------------------
--------------Input Dimension----------
640 640 640 1280 1280 1280
-------------------------------------------------------------------
via code
CPU+SYS
via time
real
user
sys
-------------------------------------------------------------------
- Why are the results from CPU+SYS different from the
time results (you may need to inspect the code to confirm
your thoughts).
- Using the above data, plus perhaps additional timing runs, deduce
how the computational cost of linear varies with the input
dimension n. You must include experimental justification
for your answer. (We are looking for an expression of the form
O(nk) where k is an integer).
Code Profiling
Objective - to understand how to profile code and analyze the results
Program
linear.c contains the following routines:
void main
void iprtmatrix
void prtmatrix
void gauss_elim
void largest
void elim
void pivot
void subst
double cpu_time
In addition the executable will make use of some system libraries,
e.g.
drand48 is used to provide pseudo random numbers. We
wish now to determine what routine(s) are consuming most of the CPU
time. We will do this using
gprof
Compile linear.c with profiling enabled. This is done using
the -pg compiler option. The Makefile has been
modified to easily do this by typing
make linear_gprof
Now run the profile enabled version of the code, e.g.
linear_gprof 400
if you look in your directory you will find that a file
gmon.out has been created. Analyze this by typing
gprof linear_gprof
It will produce a lot of output that you may want to redirect into a
file (e.g. to put output in a file called
file.profile
gprof linear_gprof >& file.profile
Alternatively you can limit the size of the profile via the
-b option to
gprof, which disables the printing of the blurb:
gprof -b linear_gprof
Profile the code for
n=800
- Which routine is the dominant routine and what fraction of the
total CPU is it taking?
- How many times are the routines elim, pivot and
largest called?
- Use the profile
output to construct a calling graph for those routines contained
in linear.c, i.e. show how each routine is
called and approximately how much time is spent in a given routine
when it is called from the parent routine. You should
verify your calling graph by inspecting the code.
We will now profile the code using basic block profiling with
gcov. This will tell us how many times different lines of
code are actually executed. To profile the code in this manner we
must compile with the option -fprofile-arcs -ftest-coverage. The
Makefile has been conveniently modified to do this, type:-
make linear_gcov
Now run the gcov enabled version of the code, e.g.
linear_gcov 400
if you look in your directory you will find the
files
linear.gcda and
linear.gcno, these files contain
information needed by gcov. To obtain the coverage information we
must provide gcov with the name of the source code file
gcov linear.c
(do
man gcov for more information). This produces an output
file
linear.c.gcov.
Use gcov profiling for n=400
- Which line of the code is executed the most times (cut out the
source code and report the number of times this line is executed).
- Is the location of this line consistent with your gprof profiles?
- How does the number of times this line is executed change as you
vary the problem size? (NOTE, between gcov runs you should remove
any .gcda data files that relate to the previous run - else your
results will be a combination of all previous gcov runs).
Is this consistent with your earlier
conclusions concerning the overall scaling of the code. Is it also
consistent with the loop structure around this line? (You may want
to run gcov profiling for a variety of problem sizes).
- What other lines in the code have significant execution counts and
how do they scale with the input problem size?
- How many basic blocks does the code contain, and how many of
these were actually executed?
Intel VTune AmplifierXE 2011 Demonstration
Objective - Introduce an advantageg profiler tool
Intel® VTuneTM Amplifier XE is a powerful threading and performance optimization tool for C/C++, .NET, and Fortran developers who need to understand an application's serial and parallel behavior to improve performance and scalability.
You can download a 30 day evaluation version from the Intel's website (900MB) and run it on your onw machine. It supports Windows, Linux, and Mac OS X.
For system wide statistics you need a super-user account!
There are a few demo videos here.
For COMP6464/Honours Students (and others if desired) : Machine
Information
Objective - find information out about the hardware we are using!
In lectures we have discussed RISC hardware. Often exactly which
algorithm is used in a high performance computing application will
depend on the exact configuration of the platform being used. For both
partch
- What is the name of the machine you are considering?
- What Operating system and what version of it is running. How did
you tell?
- What speed processor (MHZ) is on this machine
and how did you determine this information?
- How much memory is on this machine and how did you determine this?
- How many CPUs does it have and how did you determine this?
- What is size of the level 1 data cache and level 2 cache (also
known as ecache or external cache) and how did you gain this information?
You will find the following:-
uname,
dmesg,
fpversion,
cpuinfo,
meminfo and
prtconf to be useful starting points!