COMP3320/6464 2008: Laboratory 2
Hardware Architecture and Code Performance
The aim of this lab is to understand how hardware architecture can
dramatically affect the performance of your code.
YOU MUST LOGON TO IWAKI TO DO THIS LAB
Start by creating a directory and copying all the relevant programs
into that directory:
mkdir lab2
cd lab2
cp -r /dept/dcs/comp3320/public/lab2/* .
Lab Assessment
You are not required to submit labs, but all lab material is
examinable! You are therefore advised to record you answers to each
question.
If you copied the files as above you will have a template answer file
called:
COMP3320_LAB2.txt
Where relevant write answers in plain text in this file.
- COMP3320_LAB2.txt: text answers to questions below
Hardware Performance Counters
Objective - to demonstrate the use of hardware performance counters to
gain very accurate performance information
As discussed in lectures, the UltraSPARC CPU contain two hardware
performance counter registers -
pic0 and
pic1. Exactly what events can be counted differs between
UltraSPARC hardware. To find out what can be counted on Iwaki do
"cpustat -h"
- What machine (use hostname) are your running on? If it
is not iwaki log out and log on to iwaki!
- What is its clock rate (use /usr/local/bin/fpversion)?
As well as giving the processor speed,
fpversion also
gives the cache characteristics of
the machine you are using in the form
-xcache=16/32/1:1024/64/1. To interpret this do
man
cc and look for
xcache.
- For BOTH the L1 and L2
caches what are their i) sizes, ii) cache line sizes, iii)
associativity.
Program
pcount repeatedly executes the following for loop
/* Begin Critical Loop */
if (cpc_take_sample(&before) == -1) exit(-1);
for (k = 0; k < nsize-1; k++){
sum = sum + a[k]*b[k];
}
if (cpc_take_sample(&after) == -1) exit(-1);
/* End Critical Loop */
interrogating the hardware performance counters immediately before and
after the for loop. Look at the code and make sure you understand what
it is doing.
Compile pcount.c by typing make pcount. Run the code
by typing pcount nsize nrpt where nsize and
nrpt are explained below.
- The UltraSPARC III processor can initiate one floating point
multiplication and one floating point addition each clock cycle.
Given this how many cycles might you expect the above loop to take?
- If the line
sum = sum + a[k]*b[k];
were replaced by the line
sum = sum + a[k]*(a[k]+b[k]);
how many cycles would now be required to complete 1 iteration?
We are now going to use hardware performance counters to measure
the performance of the
sum=sum+a[k]*b[k] loop as a function
of vector length. Program
pcount takes as command line input
two parameters,
nsize and
nrpt.
nsize is
the length of the vectors
a and
b.
nrpt is
the number of times we will repeat the whole timing exercise. (Look at
code to ensure you understand this).
- Complete the following table.
To gather data quickly you may find this command useful!
iwaki> repeat 3 pcount 10 1000000 | grep Meas |
nawk '{if (NR%2 == 1){xx=$3}else{print xx " " xx/10000000 " " $3/xx}}'
...assuming you know awk!
------------------------------------------------------
nsize nrpt Cycles Cycle/iter Inst/cycle
------------------------------------------------------
10 1000000
10 1000000
10 1000000
100 100000
100 100000
100 100000
1000 10000
1000 10000
1000 10000
10000 1000
10000 1000
10000 1000
100000 100
100000 100
100000 100
1000000 10
1000000 10
1000000 10
------------------------------------------------------
Note that since the product of
nsize*nrpt constant the number of times the line
sum=sum+a[k]*b[k] is executed is constant. Record three
results for each pair of nsize and nrpt values.
In the table Cycle/iter is the number of cycles required to
execute the line sum=sum+a[k]*b[k] and is computed by
you. Inst/cycle is the number of instructions issued per
cycle and is a measure of how well we are using the superscalar
hardware. Again this quantity is computed by you. Use floating point
numbers in your table where applicable.
- In the above table you should find that with increasing
nsize the performance initially improves, but then gets
worse. Explain why this is.
- Earlier you were asked to predict the best performance obtainable for
this loop on the UltraSPARC III. What percentage of this
theoretical peak is i) your fastest result ii) your
slowest result? (You should find there is a big difference between
the best and worse).
We now want to use hardware counters to study cache access
characteristics. You can change the events to be calculated by
setting the
PERFEVENTS environment variable. If you look at
the code you will see that in the absence of this environment variable
it effectively defaults to:
setenv PERFEVENTS pic0=Cycle_cnt,pic1=Instr_cnt
- By setting the relevant performance counters we can gather
information on the level 1 data cache and level 2 or "ecache" hit
and miss rate. State which performance counters you would use?
(Recall from above that cpustat -h gives you a list of
possible counters)
- Complete the following table:
(Modify your shell/awk command from above)
------------------------------------------------------
Level 1 D-Cache Reads
nsize nrpt ref misses hits
------------------------------------------------------
10 1000000
10 1000000
10 1000000
100 100000
100 100000
100 100000
1000 10000
1000 10000
1000 10000
10000 1000
10000 1000
10000 1000
100000 100
100000 100
100000 100
1000000 10
1000000 10
1000000 10
------------------------------------------------------
------------------------------------------------------
Level 2 D-Cache Access
nsize nrpt ref misses hits
------------------------------------------------------
10 1000000
10 1000000
10 1000000
100 100000
100 100000
100 100000
1000 10000
1000 10000
1000 10000
10000 1000
10000 1000
10000 1000
100000 100
100000 100
100000 100
1000000 10
1000000 10
1000000 10
------------------------------------------------------
Cache Line Size
Objective - to show how careful arrangement of data can lead to much
better cache usage and improved performance
Program
align.c defines an array of the following
struct:
struct vector {
int number;
double x0, y0, z0;
double x1, y1, z1;
struct vector *previous;
} ll[NUM];
it then proceeds to measure the cycles required to perform the following
operation
for ( i = 0; i < NUM; i++) {
sum += fabs(ll[i].x0 - ll[i].x1);
}
repeating this loop
LOOPS times and printing out the
minimum and maximum number of cycles used. Look at the code and
make sure you uderstand what it is doing.
Compile align.c by typing make align. Run the code
by typing align. There is no input.
- Run the code several times. (Use the repeat command you used above).
What is the minimum number of cycles required?
- The code contains the line
#pragma align 32 (ll)
what does this line do? (Guess!)
- Add code to printout the address of each of the following
ll[k].number
ll[k].x0 ll[k].y0 ll[k].z0
ll[k].x1 ll[k].y1 ll[k].z1
ll[k].previous
for k=0,1,1023 and 1024. What are these addresses?
- Change the definition of vector to be
struct vector {
int number;
double x0, x1,
double y0, y1;
double z0, z1;
struct vector *previous;
} ll[NUM];
what is the minimum cycle time now? It should now be significantly
less. Explain why this is?
-
- Now change the definition of vector to be
struct vector {
int number;
struct vector *previous;
double x0, x1,
double y0, y1;
double z0, z1;
} ll[NUM];
what is the minimum cycle time now? It should be larger - why?
Cache Associativity
Objective - to see how cache line associativity can directly impact
observed performance
Similar to
align.c program
assoc.c defines an array
of
struct:
struct {
double x1, x2, x3, x4;
} ll[4*NUM];
and then proceeds to measure the cycles required to perform the following
operation
for ( i = 0; i < NUM; i++) {
ll[i].x1 = ll[i].x1 + ll[i+1*stride].x1 + ll[i+2*stride].x1
+ ll[i+3*stride].x1 + ll[i+4*stride].x1;
ll[i].x2 = ll[i].x2 + ll[i+1*stride].x2 + ll[i+2*stride].x2
+ ll[i+3*stride].x2 + ll[i+4*stride].x2;
ll[i].x3 = ll[i].x3 + ll[i+1*stride].x3 + ll[i+2*stride].x3
+ ll[i+3*stride].x3 + ll[i+4*stride].x3;
ll[i].x4 = ll[i].x4 + ll[i+1*stride].x4 + ll[i+2*stride].x4
+ ll[i+3*stride].x4 + ll[i+4*stride].x4;
}
repeating this operation
LOOPS times and printing out the
minimum and maximum number of cycles used. The value of
stride (corrected to be between 1 and NUM-1) is read as
command line input.
Compile assoc.c by typing make assoc. Run the code
by typing assoc stride .
- Complete the following table:
---------------------------------
Minimum Cycles
Stride Exp1 Exp2 Exp3
---------------------------------
1
2
3
1024
2048
4096
---------------------------------
- You should find that the performance decreases markedly for the last 3
cases. Explain why this is the case?
- Perform additional measurements with stides close to 2048, eg
stride=2046,2047,2049,2050... etc. When exactly does the performance
change markedly from good to bad or vice versa?
- You might have expected the above transition only to occur for
stride=2048 or multiples thereof - why? Why does it occur for 1024?
- Recode this loop to remove this problem. (HINT you want to
persuade the compile to move all data from a cache line into
registers before that cache line is evicted!)
If you have a little time...
- Use performance counters in assoc.c to investigate level 1
cache activity and hopefully to confirm the reasons you have
just given for the poor performance!
COMP6464 Only : Stream
Objective - investigate a widely used 3rd party benchmark for
measuring memory bandwidth
Stream is a widely used
memory bandwidth test. Go to the
source code directory
at the Stream web site and download the follow file:
stream.c
- Read the header to the stream.c and try compiling the
code accordingly. Now run the code on iwaki and partch, and
complete the following table. What compiler options did you use
on both machines?
MACHINE 1 - Iwaki
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy:
Scale:
Add:
Triad:
MACHINE 2 - Partch
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy:
Scale:
Add:
Triad:
- The Copy operation corresponds to:
for (i = 0; i < LENGTH; i++) a[i] = b[i];
write similar expressions for Scale, Add and Triad
- Given you knowledge of computer hardware would you expect
Stream to report any difference in the MB/s rate for each of the
above operations (Copy, Scale, Add, Triad)? Provide justification
for you answer.
- Now recompile the code on Iwaki using the -fast option. Record
new data (it should be better than before).
MACHINE 1 - Iwaki
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy:
Scale:
Add:
Triad:
- Based on the MB/s results obtained using the -fast
compiler flag compute MFLOP rates for the Add and Triad results
assuming we are working with 8byte data quantities. Show how you
compute these rates. How do these results compare with the
theoretical peak?
- From the web site work out the following
- The Pentium IV system that gives the fastest single processor
performance. Compare this with your results on partch.
- What stream preformance might you expect for the
APAC national facility SGI system
- The machine that gives the best single processor results