|
|
COMP8320
Laboratory 04 - week 4, 2011
Locks and Barriers
This session will provide an evaluation of synchronization algorithms,
together with some tools to analyze them, and the effects of thread
placement. Please seek your tutor's attention if there is anything you
don't understand (or think is worth commenting on!).
Copy the files for the session into
your home directory:
cp -pr /dept/dcs/comp8320/public/lab04/ .
Locks
Now cd scalable_lock and type the command make.
Briefly inspect the files and get an idea of what is implemented where.
The program locktest can evaluate 7 different kinds of locks:
0: ticket, 1: MCS, 2: NMCS, 3: CLH, 4: simple, 5: optimized simple.
6: optimized ticket
Do the following exercises.
- The `simple lock' is a non-scalable lock based on an atomic swap register
- memory instruction (see the definition of __xchg() in
locks.h). It is implemented in simple_locks.c.
An optimization is to only try the atomic instruction when the lock
becomes zero. In opt_simple_lock(), implement this optimization.
Test it with ./locktest p 1 1 5 for p=1,2, 4,...
(it it `hangs', you have a deadlock bug!)
- The `optimized ticket lock' is a version of the (Linux
implementation) of the ticket lock (algorithm 0), which avoids atomic
operations upon the lock release. It uses two counters, tkt
and screen. Implement this lock in the
opt_ticket_*() functions in simple_locks.c.
Note that you will need to access the ticket and screen
values through local pointer variables, which must be
volatile. Test your code with ./locktest p 1 1
6 for p=1,2, 4,....
If you get stuck on this, go on with the exercise and ask you
demonstrator later (even next lab).
-
Execute the command:
for p in 1 2 4 8 16 32; do
./locktest $p 1000 5 0
done
This will run the ticket lock (lock 0) for 1000 times and take 5
timings, for 1 to 32 threads. The median time should be regarded as the
most accurate. Repeat the above for the 6 other kinds of locks and
tabulate your results in a text file.
What trends do you observe? Which are the most scalable?
Have your optimizations been effective?
-
We will now use the cputrack tool to try to understand the
performance differences. You will probably notice that the most
interesting of which occurs with 32 threads.
Using:
cputrack -c Idle_strands,Instr_cnt -c DC_miss,Atomics -t ./locktest 32 1000000 1 0
Record the associated total cycles and event counts (upon
exit) for the ticket lock (you will notice that it prints out
the statics on every system call but it is only the final total that
is of interest). Also add the timing from the locktest program.
Repeat for the other locks and tabulate your results.
Analyze the data; in particular, why is the ticket lock slow, and what
effects has the spin loop played in the optimized simple lock? To what
extent does Idle_strands + Instr_cnt account for performance?
Under heavy contention, roughly how many attempts for an atomic add seem
to be needed? Would you expect the behaviors to be different of the
hardware natively supported atomic add, such as on the x86?
We have used cputrack in sampling mode by
specifying multiple event sets. For this reason, we increased the
iteration count. However, it brings about some distortion. To get a
feeling for this, try running one of the above commands without
cputrack.
-
An optional extra parameter to locktest places threads
on particular cores (see placethread.h). The command:
cputrack -c DC_miss,Idle_strands ./locktest 8 1000000 1 0 1
places the 8 threads on 1 core. Repeat this for 2 and 4 cores. Tabulate
the results for total execution time and the two events. What trends do
you observe? What is the tradeoff in this case for using more cores?
Barriers
Mow cd ../barriers and type the command make.
Briefly inspect the files and get an idea of what is implemented where.
The program test_barriers can evaluate 3 different kinds of barriers:
0: central, 1: (combining) tree, 2: scalable
The central barrier is not scalable in the sense that its complexity is
O(p), where p is the number of threads.
-
Execute the command:
for p in 1 2 4 8 16 32; do
./test_barriers $p 1000 10 0
done
This will run the central barrier (barrier 0) for 1000 times and take 10
timings, for 1 to 32 threads. The median time should be regarded as the
most accurate; you will notice that some timings are unstable.
Repeat the above for the 2 other kinds of barriers and
tabulate your results in a text file.
-
Again, we will use cputrack to try to understand the
performance differences.
Using separate rather than multiplexed counting of events:
cputrack -t -c DC_miss,Atomics -T 10 ./test_barriers 32 100000 1 0
cputrack -t -c Idle_strands,Instr_cnt -T 10 ./test_barriers 32 100000 1 0
record the associated event counts (upon exit) for the
central barrier. Repeat for the other barriers and tabulate your
results.
Analyze the data; in particular, what tradeoffs are there between
the tree and scalable barriers?
The -T 10 suppresses printing of statistics upon timer interrupts
(by default every 1s) to an interval of 10s.
-
Finally, we will look at the effects of thread placement on barriers.
Execute:
cputrack -c DC_miss,Idle_strands -T 10 ./test_barriers 8 100000 1 0 1
and repeat this for 2 and 4 cores. Tabulate
the results for total execution time and the two events. What trends do
you observe? Is is different for the case of locks?
-
As an exercise in the use of the collect profile tool,
execute the commands:
mkdir /tmp/$USER
collect -h DC_miss,on,Idle_strands,on -p on -d /tmp/$USER ./test_barriers 8 1000000 1 0 1
Run the command ./test_barriers 8 1000000 1 0 1 alone to see what
distortion this introduced into the run-time. To analyze the data, use:
analyzer /tmp/$USER/test.1.er &
(this will take some time to come up...).
From the View menu, select Set
Data Presentation and Sort by Exclusive User Time. Now look at
the Functions tab. You should see a list of the functions in
order of exclusive user time.
Alternately, for a faster and text friendly version, run the command:
er_print /tmp/$USER/test.1.er
Useful sub-commands include
statistics
metrics
sort e.user
fsummary
help
quit
Here we sort the data on the exclusive user time for each function,
that is the time spent in executing instructions from this function.
See on-line man pages for collect and er_print
for more details.
Homework!
You could try some experiments of your own; for example, running the
above with up to 56 threads (which we probably don't want everyone doing
at once during a lab session!).
Revise again the lecture segment on locks and barriers to understand the
algorithms. You might also find reading the source code useful. Try to
reconcile the results with your understanding. Compare also your results
to the small-scale x86 SMP that were presented. How important is it to
use scalable synchronization algorithms for highly
multithreaded/multicore processors?
Last modified: 18/08/2011, 12:41
|