CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
School of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP8320 Tutorial 03 -- week 4, 2011

Performance Issues

Please read the article mentioned below before the tutorial. If you have any unresolved questions or things you would like explained from earlier in the course, please have these ready and let your tutor know tutor at the start of the session. This can include anything about Assignment 1!

Afterwards, discuss in small groups the following questions:

  1. From your reading of the above article :

    1. What performance issues do multicore processors have in particular (that are less likely to be a problem on a traditional multiprocessor)? Give at least two reasons.

    2. What is the problem with most current performance analysis tools?

    3. Considering that it can work on unmodified executable programs, what hardware mechanism(s) do you think the fingerprint analysis technique could be using?

  2. The program dsum 10000 completed a sum of its vector in 0.000124s with only a single thread. Calculate the corresponding floating point operation rate.

    Now consider the optimized assembler code for the loop below. Perform a `cycle count' of the loop for single strand performance. To do this, annotate each instruction the code below with the the cycle you would expect it to start on, beginning with cycle 1 for the first. Subsequent instructions must be at least 1 cycle later; they can be more if a load/floating point dependency (4 cycles) or floating point/ floating point dependency (7 cycles) is not satisfied. There is a 5 cycle penalty for a taken branch and for a prefetch instruction. Note that underscores have been added to tidy up html rendering. Compare this with the actual performance; what is the most likely cause for the discrepancy? How could you confirm this?

    Hint: when the program was modified to do a `warm-up' summation loop before the timing loop, it computed a summation of a vector of length 1000 on 0.000004s. What rate does this correspond to?

            .L900000305:
    prefetch [%i4+304],0
    add_     %i3,12,%i3   
    ldd_     [%i4], %f6    
    add_     %i4,96,%i4   
    faddd   %f26,%f12, %f12
    faddd   %f2,%f14, %f14
    cmp_     %i3,%i5
    ldd_     [%i4-88], %f4 
    faddd   %f8,%f16, %f16 
    ldd_     [%i4-80], %f8
    faddd   %f10,%f18, %f18
    ldd_     [%i4-72], %f30
    faddd   %f12,%f20, %f2
    ldd_     [%i4-64], %f12
    faddd   %f14,%f22, %f22
    ldd_     [%i4-56], %f14
    faddd   %f16,%f24, %f24
    ldd_     [%i4-48], %f16
    faddd   %f18,%f0, %f0
    ldd_     [%i4-40], %f18
    faddd   %f2,%f6, %f26 
    ldd_     [%i4-32], %f20
    faddd   %f22,%f4, %f2 
    ldd_     [%i4-24], %f22
    faddd   %f24,%f8, %f8 
    ldd_     [%i4-16],%f24
    faddd   %f0,%f30, %f10 
    ble,pt  %icc,.L900000305
    ldd_     [%i4-8], %f0  
    

  3. Consider the ticket lock algorithm discussed in the week 4 lecture. How would you implement it on an architecture that (i) had an atomic increment instruction (ii) an atomic compare-and-exchange function:
    unsigned cmpxchg(volatile unsigned *x, unsigned v_old, unsigned v_new)
    // pre:  *x = v
    // post: if (v == v_old) *x = v_new (atomic)
    //       returns v     
    
    Note that such a routine would be implemented using an atomic compare-and-swap instruction (e.g. cas [x],r1,r2 swap memory location x with register r2 if the contents of x is the same as r1.

Last modified: 18/08/2011, 12:34

Copyright | Disclaimer | Privacy | Contact ANU