CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
School of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP8320 Laboratory 04 - week 4, 2011

Locks and Barriers


This session will provide an evaluation of synchronization algorithms, together with some tools to analyze them, and the effects of thread placement. Please seek your tutor's attention if there is anything you don't understand (or think is worth commenting on!).

Copy the files for the session into your home directory:

    cp -pr /dept/dcs/comp8320/public/lab04/ .

Locks

Now cd scalable_lock and type the command make. Briefly inspect the files and get an idea of what is implemented where. The program locktest can evaluate 7 different kinds of locks:
    0: ticket, 1: MCS, 2: NMCS, 3: CLH, 4: simple, 5: optimized simple. 6: optimized ticket
Do the following exercises.
  1. The `simple lock' is a non-scalable lock based on an atomic swap register - memory instruction (see the definition of __xchg() in locks.h). It is implemented in simple_locks.c. An optimization is to only try the atomic instruction when the lock becomes zero. In opt_simple_lock(), implement this optimization. Test it with ./locktest p 1 1 5 for p=1,2, 4,... (it it `hangs', you have a deadlock bug!)

  2. The `optimized ticket lock' is a version of the (Linux implementation) of the ticket lock (algorithm 0), which avoids atomic operations upon the lock release. It uses two counters, tkt and screen. Implement this lock in the opt_ticket_*() functions in simple_locks.c. Note that you will need to access the ticket and screen values through local pointer variables, which must be volatile. Test your code with ./locktest p 1 1 6 for p=1,2, 4,....
    If you get stuck on this, go on with the exercise and ask you demonstrator later (even next lab).

  3. Execute the command:
      for p in 1 2 4 8 16 32; do
      ./locktest $p 1000 5 0
      done
    This will run the ticket lock (lock 0) for 1000 times and take 5 timings, for 1 to 32 threads. The median time should be regarded as the most accurate. Repeat the above for the 6 other kinds of locks and tabulate your results in a text file.

    What trends do you observe? Which are the most scalable? Have your optimizations been effective?

  4. We will now use the cputrack tool to try to understand the performance differences. You will probably notice that the most interesting of which occurs with 32 threads. Using:
      cputrack -c Idle_strands,Instr_cnt -c DC_miss,Atomics -t ./locktest 32 1000000 1 0
    Record the associated total cycles and event counts (upon exit) for the ticket lock (you will notice that it prints out the statics on every system call but it is only the final total that is of interest). Also add the timing from the locktest program. Repeat for the other locks and tabulate your results.

    Analyze the data; in particular, why is the ticket lock slow, and what effects has the spin loop played in the optimized simple lock? To what extent does Idle_strands + Instr_cnt account for performance? Under heavy contention, roughly how many attempts for an atomic add seem to be needed? Would you expect the behaviors to be different of the hardware natively supported atomic add, such as on the x86?

    We have used cputrack in sampling mode by specifying multiple event sets. For this reason, we increased the iteration count. However, it brings about some distortion. To get a feeling for this, try running one of the above commands without cputrack.

  5. An optional extra parameter to locktest places threads on particular cores (see placethread.h). The command:
      cputrack -c DC_miss,Idle_strands ./locktest 8 1000000 1 0 1
    places the 8 threads on 1 core. Repeat this for 2 and 4 cores. Tabulate the results for total execution time and the two events. What trends do you observe? What is the tradeoff in this case for using more cores?

Barriers

Mow cd ../barriers and type the command make. Briefly inspect the files and get an idea of what is implemented where. The program test_barriers can evaluate 3 different kinds of barriers:
    0: central, 1: (combining) tree, 2: scalable
The central barrier is not scalable in the sense that its complexity is O(p), where p is the number of threads.
  1. Execute the command:
      for p in 1 2 4 8 16 32; do
      ./test_barriers $p 1000 10 0
      done
    This will run the central barrier (barrier 0) for 1000 times and take 10 timings, for 1 to 32 threads. The median time should be regarded as the most accurate; you will notice that some timings are unstable. Repeat the above for the 2 other kinds of barriers and tabulate your results in a text file.

  2. Again, we will use cputrack to try to understand the performance differences. Using separate rather than multiplexed counting of events:
      cputrack -t -c DC_miss,Atomics -T 10 ./test_barriers 32 100000 1 0
      cputrack -t -c Idle_strands,Instr_cnt -T 10 ./test_barriers 32 100000 1 0
    record the associated event counts (upon exit) for the central barrier. Repeat for the other barriers and tabulate your results.

    Analyze the data; in particular, what tradeoffs are there between the tree and scalable barriers?

    The -T 10 suppresses printing of statistics upon timer interrupts (by default every 1s) to an interval of 10s.

  3. Finally, we will look at the effects of thread placement on barriers. Execute:
      cputrack -c DC_miss,Idle_strands -T 10 ./test_barriers 8 100000 1 0 1
    and repeat this for 2 and 4 cores. Tabulate the results for total execution time and the two events. What trends do you observe? Is is different for the case of locks?

  4. As an exercise in the use of the collect profile tool, execute the commands:
      mkdir /tmp/$USER
      collect -h DC_miss,on,Idle_strands,on -p on -d /tmp/$USER ./test_barriers 8 1000000 1 0 1
    Run the command ./test_barriers 8 1000000 1 0 1 alone to see what distortion this introduced into the run-time. To analyze the data, use:
      analyzer /tmp/$USER/test.1.er &
    (this will take some time to come up...). From the View menu, select Set Data Presentation and Sort by Exclusive User Time. Now look at the Functions tab. You should see a list of the functions in order of exclusive user time.

    Alternately, for a faster and text friendly version, run the command:

      er_print /tmp/$USER/test.1.er
    Useful sub-commands include
      statistics
      metrics
      sort e.user
      fsummary
      help
      quit
    Here we sort the data on the exclusive user time for each function, that is the time spent in executing instructions from this function.

    See on-line man pages for collect and er_print for more details.

Homework!

You could try some experiments of your own; for example, running the above with up to 56 threads (which we probably don't want everyone doing at once during a lab session!).

Revise again the lecture segment on locks and barriers to understand the algorithms. You might also find reading the source code useful. Try to reconcile the results with your understanding. Compare also your results to the small-scale x86 SMP that were presented. How important is it to use scalable synchronization algorithms for highly multithreaded/multicore processors?

Last modified: 18/08/2011, 12:41

Copyright | Disclaimer | Privacy | Contact ANU