![]() |
ANU College of Engineering and Computer Science
School of Computer Science
|
|
|
COMP8320 Laboratory 02 - week 2, 2011
OpenMP on SolarisIn this session, we will look at how OpenMP is implemented on SPARC/Solaris. It will also serve as a catch-up form the previous session. In this session, you will encounter issues, some quite deep, relating to multicore computer programming and performance. These notes will ask you questions on these, as they come up. Its good to think about them, but as time is limited, quickly check your understanding by asking the demonstrator, and then move on. Logging in and Setting UpLog in to wallaman. You should see a prompt like:
Complete Laboratory 01If you have not done so, complete the previous week's lab.OpenMP: How the atomic directive is implementedRecall that we protected the update of the variable sum in dsum_omp_atomic.c. by adding the line:
./threadrun 32 ./dsum_omp_atomic 100000
OpenMP: how loops are parallelizedAsk the compiler to produce an assembly listing of its compilation of the dsum_omp:
Now locate for the entry point to the main() (near the top of the file). Search for consecutive call instructions. You will see a call to the master function __mt_MasterFunction_rtc_(); go past this till you find the second one. This is for the second loop; you will see (a number of instructions up) that the address of _$d1B30.main is being placed on the stack for this function to use. So how does this work? The first call to master function creates the threads and sets them to execute the function for the first parallel loop. The threads idle between this and the second call, which causes the threads to wake up and execute the function for the second loop. You can verify that the first call to master function creates the threads and determine the overhead of thread creation by removing the first #pragma omp, and seeing how that affects the execution time of the second loop. OpenMP: how reductions are implementedWe will now look at how reductions are implemented in OpenMP. Not only is this important in itself, the exercise will uncover more features of OpenMP and issues in parallel programming. The file dsum_omp_psum.c is set up to implement reductions using a technique call partial sums. Inspect this file. Instead of a single thread of execution in a normal program, when an OpenMP program executes, $OMP_NUM_THREADS threads get created. These are then activated whenever a parallel loop is executed. In this case, each thread is given a segment of the array to sum. Then, in a non-parallel loop, these partial sums are added together to get the total.The program uses an array psums to do this. Two issues arise: how does the program determine the size of the array, and how do the threads index the array. The former can be done by a call to the OpenMP intrinsic function omp_get_max_threads(). The latter can be done by calling the intrinsic omp_get_thread_num() which returns a unique id for each thread. However, this can only be done in a (parallel) part of the program when all the threads are active! This brings us to the concept of parallel regions. So far, a parallel region has been a simple loop, but we want each thread to get its thread id outside the loop. In C programs, a region can be defined in a code block ({ ... } ). You will see such a code block around the call omp_get_thread_num() and the subsequent for loop. Just above this block, insert the directive:
So far so good, but we have not actually instructed the compiler to parallelize the loop! To do so, insert the directive:
Programming ExerciseFor SMP systems (with CPUs on separate chips with cache coherency hardware between them), performance will be highly degraded unless we pad out the psum array so that there is one element used per cache line (typically 8 words). The phenomenon is called false cache line sharing. However, as it is a multicore processor with CPUs on a single chip, this makes little difference on the T2.As an exercise, verify this by copying dsum_omp_psum.c to a new file dsum_omp_psum_pad.c and `pad out' the psums[] array by a factor of 8 (i.e. make it 8 times larger, and only use every 8th element). Note that the (level 2) cache line size is 64 bytes, so every element that is used will be on a separate cache line. Compile and run this program and compare it with dsum_omp_psum. Concluding RemarksIn this session, we have looked at how the relatively simple OpenMP model is implemented using a threaded programming model, in this case Solaris threads (closely related to Posix pthreads).As a review, consider the following questions:
The examples have been oriented to parallelizing simple loops. But the T2 is designed for commercial applications; how are they programmed to harness concurrency? Generally, threads are explicitly programmed, in for example Java. The programming is more complex, too complex to cover in a one hour session, but the issues of data hazards, speedups, shared and private data apply equally. Extra Exercise: Atomic Operations on the SPARCWe have suspected that in the mt runtime library, the atomic directives are ultimately implemented in terms of (SPARC) atomic instructions, which are used to synchronize the VCPUs on the T2. You can investigate this, First locate where the mt shared library that the dsum_omp_atomic uses is:
If you repeat this exercise for the function that is called when you end and atomic region (search for e_atomic, you will see that it similarly uses the atomic_store function.
Last modified: 3/08/2011, 11:27
|
|||||||||||||||||||||||||||||||||||||||||||||||||
Please direct all enquiries to:
Page authorised by: Head of Department, DCS |
| The Australian National University — CRICOS Provider Number 00120C |