CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
School of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP8320 Laboratory 02 - week 2, 2011

OpenMP on Solaris


In this session, we will look at how OpenMP is implemented on SPARC/Solaris. It will also serve as a catch-up form the previous session.

In this session, you will encounter issues, some quite deep, relating to multicore computer programming and performance. These notes will ask you questions on these, as they come up. Its good to think about them, but as time is limited, quickly check your understanding by asking the demonstrator, and then move on.

Logging in and Setting Up

Log in to wallaman. You should see a prompt like:
    u8914893@wallaman:~>
If not, your account may not be set up properly for the T2. See the instructions for Laboratory 1; or get help from your demonstrator. Go to the same directory that you used for Lab 01.

Complete Laboratory 01

If you have not done so, complete the previous week's lab.

OpenMP: How the atomic directive is implemented

Recall that we protected the update of the variable sum in dsum_omp_atomic.c. by adding the line:
    #pragma omp atomic
Check that the directive is still there and recompile:
    cc -fast -xopenmp -o dsum_omp_atomic dsum_omp_atomic.c
and verify that the program works correctly (albeit slowly). Compare with the original dsum_omp program:
    ./threadrun 32 ./dsum_omp 100000
    ./threadrun 32 ./dsum_omp_atomic 100000
If you ask the compiler to produce an assembly listing of its compilation:
    cc -fast -xopenmp -S dsum_omp_atomic.c
and then search for the keyword `atomic' in dsum_omp_atomic.s, you will see how the OpenMP atomic directive is implemented by the OpenMP runtime library (mt). What do you think the associated function calls do? (recall the shared memory slide from Lecture 1? The atomic directive is however efficient in cases were it is called less frequently.

OpenMP: how loops are parallelized

Ask the compiler to produce an assembly listing of its compilation of the dsum_omp:
    cc -fast -xopenmp -S dsum_omp.c
Load dsum_omp.s into an editor and go to the bottom of the file. Search backward for the faddd (Floating point ADD Double) instruction and look for a (heavily unrolled) loop (which also has a lot of ldd (LoaD Double) instructions). Note how aggressively the compiler is optimizing the loop. Scroll upwards, and you will see that this loop is actually part of a function (subroutine), called something like _$d1B30.main. Near the entry of the function, you will see that it calls a system function __mt_get_next_chunk...(), which instructs it on which iterations to work on.

Now locate for the entry point to the main() (near the top of the file). Search for consecutive call instructions. You will see a call to the master function __mt_MasterFunction_rtc_(); go past this till you find the second one. This is for the second loop; you will see (a number of instructions up) that the address of _$d1B30.main is being placed on the stack for this function to use.

So how does this work? The first call to master function creates the threads and sets them to execute the function for the first parallel loop. The threads idle between this and the second call, which causes the threads to wake up and execute the function for the second loop.

You can verify that the first call to master function creates the threads and determine the overhead of thread creation by removing the first #pragma omp, and seeing how that affects the execution time of the second loop.

OpenMP: how reductions are implemented

We will now look at how reductions are implemented in OpenMP. Not only is this important in itself, the exercise will uncover more features of OpenMP and issues in parallel programming. The file dsum_omp_psum.c is set up to implement reductions using a technique call partial sums. Inspect this file. Instead of a single thread of execution in a normal program, when an OpenMP program executes, $OMP_NUM_THREADS threads get created. These are then activated whenever a parallel loop is executed. In this case, each thread is given a segment of the array to sum. Then, in a non-parallel loop, these partial sums are added together to get the total.

The program uses an array psums to do this. Two issues arise: how does the program determine the size of the array, and how do the threads index the array. The former can be done by a call to the OpenMP intrinsic function omp_get_max_threads(). The latter can be done by calling the intrinsic omp_get_thread_num() which returns a unique id for each thread. However, this can only be done in a (parallel) part of the program when all the threads are active!

This brings us to the concept of parallel regions. So far, a parallel region has been a simple loop, but we want each thread to get its thread id outside the loop. In C programs, a region can be defined in a code block ({ ... } ). You will see such a code block around the call omp_get_thread_num() and the subsequent for loop. Just above this block, insert the directive:

    #pragma omp parallel private(thr_id)
The variable thr_id stores the value from omp_get_thread_num(). It must be declared as a `private' variable in order to ensure each thread gets a unique copy of it (if it was a normal (shared) variable, only a single value could be stored in it! You could later try removing the private(thr_id) part, and see what happens!).

So far so good, but we have not actually instructed the compiler to parallelize the loop! To do so, insert the directive:

    #pragma omp for
just above the for loop. Compile the program using:
    cc -fast -xopenmp -o dsum_omp_psum dsum_omp_psum.c
Run the program in single threaded mode:
    export OMP_NUM_THREADS=1; ./dsum_omp_psum 1000000
and repeat for 2, 4 and 8 threads. Compare with dsum_omp; have we achieved the same performance?

Programming Exercise

For SMP systems (with CPUs on separate chips with cache coherency hardware between them), performance will be highly degraded unless we pad out the psum array so that there is one element used per cache line (typically 8 words). The phenomenon is called false cache line sharing. However, as it is a multicore processor with CPUs on a single chip, this makes little difference on the T2.

As an exercise, verify this by copying dsum_omp_psum.c to a new file dsum_omp_psum_pad.c and `pad out' the psums[] array by a factor of 8 (i.e. make it 8 times larger, and only use every 8th element). Note that the (level 2) cache line size is 64 bytes, so every element that is used will be on a separate cache line. Compile and run this program and compare it with dsum_omp_psum.

Concluding Remarks

In this session, we have looked at how the relatively simple OpenMP model is implemented using a threaded programming model, in this case Solaris threads (closely related to Posix pthreads).As a review, consider the following questions:
  • How is OpenMP implemented: how are loops parallelized, how is the atomic directive implemented, and how are reductions implemented?

The examples have been oriented to parallelizing simple loops. But the T2 is designed for commercial applications; how are they programmed to harness concurrency? Generally, threads are explicitly programmed, in for example Java. The programming is more complex, too complex to cover in a one hour session, but the issues of data hazards, speedups, shared and private data apply equally.

Extra Exercise: Atomic Operations on the SPARC

We have suspected that in the mt runtime library, the atomic directives are ultimately implemented in terms of (SPARC) atomic instructions, which are used to synchronize the VCPUs on the T2. You can investigate this, First locate where the mt shared library that the dsum_omp_atomic uses is:
    ldd dsum_omp_atomic
You will be able to guess that it is must be in /lib/libmtsk.so.1. Get a disassembly of this file:
    objdump -d /lib/libmtsk.so.1 > libmtsk.dis
Now search libmtsk.dis for b_atomic to locate the function called at the start of an OpenMP atomic section of code. You will notice that it is more complex than you would expect (ask your demonstrator if you are curious, but a hint is that Solaris uses adaptive spinlocks). You will also notice that it calls a function called atomic_store. Go to the bottom of the file and search for atomic_store. You will see that it uses the swap instruction, which atomically swaps the contents of a register and a memory location (other SPARC atomic instructions are cas (compare and swap) and ldstub load-store unsigned byte). In the same area, you will see other functions to do primitive synchronization operations, such as the atomic decrement of a memory location.

If you repeat this exercise for the function that is called when you end and atomic region (search for e_atomic, you will see that it similarly uses the atomic_store function.

Last modified: 3/08/2011, 11:27

Copyright | Disclaimer | Privacy | Contact ANU