|
|
COMP8320 Tutorial 03 -- week 4, 2011
Performance Issues
Please read the article mentioned below before the tutorial.
If you have any unresolved questions or things you would like explained
from earlier in the course, please have these ready and let your tutor
know tutor at the start of the session. This can include anything about
Assignment 1!
Afterwards, discuss in small groups the following questions:
- From your reading of the above article :
- What performance issues do multicore processors have in particular
(that are less likely to be a problem on a traditional multiprocessor)?
Give at least two reasons.
- What is the problem with most current performance analysis tools?
- Considering that it can work on unmodified executable programs,
what hardware mechanism(s) do you think the fingerprint analysis
technique could be using?
-
The program dsum 10000 completed a sum of its vector in
0.000124s with only a single thread. Calculate the
corresponding floating point operation rate.
Now consider the optimized assembler code for the loop below.
Perform a `cycle count' of the loop for single strand performance.
To do this, annotate each instruction the code below with the the cycle
you would expect it to start on, beginning with cycle 1 for the first.
Subsequent instructions must be at least 1 cycle later; they can be more
if a load/floating point dependency (4 cycles) or floating point/
floating point dependency (7 cycles) is not satisfied.
There is a 5 cycle penalty for a taken branch and for a prefetch
instruction.
Note that underscores have been added to tidy up html rendering.
Compare this with the actual performance; what is the most likely
cause for the discrepancy? How could you confirm this?
Hint: when the program was modified to do a `warm-up'
summation loop before the timing loop, it computed a summation of a vector
of length 1000 on 0.000004s. What rate does this correspond to?
.L900000305:
prefetch [%i4+304],0
add_ %i3,12,%i3
ldd_ [%i4], %f6
add_ %i4,96,%i4
faddd %f26,%f12, %f12
faddd %f2,%f14, %f14
cmp_ %i3,%i5
ldd_ [%i4-88], %f4
faddd %f8,%f16, %f16
ldd_ [%i4-80], %f8
faddd %f10,%f18, %f18
ldd_ [%i4-72], %f30
faddd %f12,%f20, %f2
ldd_ [%i4-64], %f12
faddd %f14,%f22, %f22
ldd_ [%i4-56], %f14
faddd %f16,%f24, %f24
ldd_ [%i4-48], %f16
faddd %f18,%f0, %f0
ldd_ [%i4-40], %f18
faddd %f2,%f6, %f26
ldd_ [%i4-32], %f20
faddd %f22,%f4, %f2
ldd_ [%i4-24], %f22
faddd %f24,%f8, %f8
ldd_ [%i4-16],%f24
faddd %f0,%f30, %f10
ble,pt %icc,.L900000305
ldd_ [%i4-8], %f0
- Consider the ticket lock algorithm discussed in the week 4 lecture.
How would you implement it on an architecture that (i) had an atomic
increment instruction (ii) an atomic compare-and-exchange function:
unsigned cmpxchg(volatile unsigned *x, unsigned v_old, unsigned v_new)
// pre: *x = v
// post: if (v == v_old) *x = v_new (atomic)
// returns v
Note that such a routine would be implemented using an atomic
compare-and-swap instruction (e.g. cas
[x],r1,r2 swap memory location
x with register r2 if the contents of x is
the same as r1.
Last modified: 18/08/2011, 12:34
|